Projektarbeit - Airbnb: Detailed Listings für New York City, New York, United States und Rome, Lazio, Italy (2025)¶
1. Theory 2. Data Understanding 3. Data Preparation 4. Visualisierung 5. Dash 6. NLP/Spacy
Kurs: Data Analytics Dozent: Axel Wemmel Teilnehmer: Natalia Mokeeva Abgabedatum: 18.09.2025
Inhalt - Gliederung¶
- Einleitung
- Theorie
- Vorab
- Data Understanding und Data Prepartion - New York
- Data Understanding und Data Prepartion - Rome
- EDA-Modell - New York
- Visualisierung der Listings (Plotly Express)
- Geo-Visualisierung der Listings New York
- Dash mit Diagramm und Schieberegler: New York
- Dashboard: Komponenten nebeneinander zeigen: New York
- Geo-Visualisierung der Listings Rome
- Dash mit Diagramm und Schieberegler: Rome
- Dashboard: Komponenten nebeneinander zeigen: Rome
- Donut-Kreischarts: Lizenzstatus New York vs. Rome)
- Natural Language Processing (Spacy)
Einleitung¶
Analysiert werden soll der Datensatz "Airbnb: Detailed Listings für New York City, New York, United States und Rome, Lazio, Italy (2025)".
Daten Details:
- Die Daten nutzen öffentlich zugängliche Informationen, die von der Airbnb-Website zusammengetragen wurden, einschließlich des Verfügbarkeitskalenders für 365 Tage in der Zukunft sowie der Bewertungen zu jedem Inserat.
- Es werden keine „privaten“ Informationen verwendet. Namen, Fotos, Inserate und Bewertungsdetails werden sämtlich öffentlich auf der Airbnb-Website angezeigt.
- Standortinformationen zu Inseraten werden von Airbnb anonymisiert. In der Praxis bedeutet das, dass der in der Karte bzw. in den Daten angegebene Standort eines Inserats bis zu 150 Meter vom tatsächlichen Adresspunkt verschoben ist.
- Die Viertelnamen für jedes Inserat werden ermittelt, indem die geografischen Koordinaten des Inserats mit der städtischen Definition der Viertel abgeglichen werden. Die von Airbnb vergebenen Viertelnamen werden aufgrund ihrer Ungenauigkeiten nicht verwendet.
"listings.csv" New York City und "listings.csv Rome, Lazio", welcher Informationen von der Airbnb-Website aus dem Jahr 2025 darstellt. Nach einem etwas genaueren Blick auf die Daten und ihre Bedeutung dahinter, werden die einzelnen Phasen des Deskriptive Statistik und des EDA-Modells genauer erläutert.
Anschließend werden die einzelnen Phasen "Data Understanding" und "Data Preparation" praktisch erläutert und aufbereitet. Mit einer Säuberung der Daten (u.a. der Spalten "property_type", "neighbourhood_group_cleansed" und "price" ist eine speziellere Analyse möglich: Dies betrifft im Anschluss v.a. Diagramm und Schieberegler: Airbnb NYC – Jahresfilter: Verfügbarkeit vs Preis in einem interaktiven Dashboard.
Noch ein paar Worte zu dem Dataset:
- Beschreibung des Erstellers:
Inside Airbnb ist ein gemeinwohlorientiertes Projekt. Es sammelt Daten zu Airbnb zur Analyse der Wirkung auf Wohngebiete. _About – Inside Airbnb_
Info:
- Die hier angegebenen Dateitypen, Formate, etc. werden im weiteren Verlauf - für eine bessere Bearbeitung - angepasst werden.
- Das Datenset kann hier heruntergeladen werden:
Datenwörterbuch:
| Abschnitt | Beschreibung | Spalten (Felder) | Einheiten / Hinweise (laut Data Dictionary) |
|---|---|---|---|
| Zeitebene | Momentaufnahme (Snapshot) des Bestands zum Erfassungszeitpunkt. | last_scraped |
Datetime (UTC) – Zeitpunkt, an dem das Listing „gescraped“ wurde. |
| Analyse-Einheiten (Granularität) | Ebenen, auf denen ausgewertet/aggregiert wird. | Listing: id • Host: host_id • Nachbarschaft: neighbourhood_cleansed, neighbourhood_group_cleansed • Zeit: host_since, first_review, last_review |
id (integer) = eindeutige Listing-ID • host_id (integer) = Host-ID • host_since (date) = Konto-Erstellungsdatum • first_review/last_review (date) = erste/letzte Rezension. |
| Wichtige Maßeinheiten (Markt/Qualität) | Zentrale Messgrößen je Listing. | Preis: price • Verfügbarkeit: availability_30, availability_60, availability_90, availability_365, has_availability • Bewertungen: number_of_reviews, reviews_per_month, review_scores_* • Kapazität/Ausstattung: accommodates, bedrooms, beds, bathrooms, bathrooms_text • Buchung: instant_bookable |
price (currency) = täglicher Preis in lokaler Währung; $ im Export ist ein technisches Artefakt (ignorieren). • availability_x (integer) = verfügbare Nächte in den nächsten x Tagen; „nicht verfügbar“ kann gebucht oder geblockt bedeuten. • has_availability (boolean) = t/f. • number_of_reviews (integer). • reviews_per_month = Durchschnitt Rezensionen/Monat über die Lebenszeit (vgl. Pseudocode im Dictionary). • review_scores_* (numerisch, Airbnb-Scores). • accommodates (integer) = Gäste. • bathrooms (numeric), bathrooms_text (string, textuelle Angabe, z. B. „1.5 shared baths“). • instant_bookable (boolean) = t/f, ob automatische Buchung möglich ist. |
| Listing-Ebene – Kernmetriken | Attribute pro Inserat (Anzeige). | property_type, room_type, accommodates, bedrooms, beds, bathrooms/bathrooms_text, price, has_availability, availability_30/60/90/365, number_of_reviews, reviews_per_month, review_scores_*, instant_bookable, host_is_superhost, host_listings_count, host_since, Lage: neighbourhood_cleansed, neighbourhood_group_cleansed, latitude, longitude |
Typen laut Dictionary: property_type, room_type, neighbourhood_* (text); latitude/longitude (numeric); host_is_superhost (boolean, t/f); host_listings_count (integer). |
| Abgeleitete Kennzahlen (nützlich, nicht original) | Aus Rohfeldern berechnet (für Analysen). | occupied_nights_365, revpar_proxy, price_per_guest, host_tenure_days |
Belegte Nächte (Proxy): occupied_nights_365 = 365 − availability_365 (Nächte/Jahr) • RevPAR (Proxy): estimated_revenue_l365d / 365 • Preis/Gast: price / accommodates • Host-Tenure (Tage): last_scraped − host_since. |
Theorie¶
Explorative Datenanalyse¶
- Die explorative Datenanalyse wird auch als explorative Statistik bezeichnet und bildet ein Teilgebiet der Statistik. Es werden Daten analysiert, zu denen kaum oder sogar keine bekannten Zusammenhänge bestehen.
- Dabei besteht das Ziel der explorativen Datenanalyse darin, einen Überblick über die vorliegenden Daten zu erhalten, um diese besser einschätzen und bewerten zu können. Erst im Rahmen der Analyse fallen Regelmäßigkeiten, Abhängigkeiten oder besondere Zusammenhänge zwischen scheinbar vollkommen ungleichen Daten auf.
- Wurden ausreichend Muster und Zusammenhänge entdeckt, um eine Aussage zu treffen, wird anschließend gezielt nach Ausreißern in dieser Masse gesucht. Die einzelnen Ausreißer werden wiederum analysiert, um zu entscheiden, ob sie aus dem Datenpool entfernt werden können oder sogar die Analyse noch weiter verfeinert werden muss.
- Fehlende Werte werden ebenfalls analysiert. Diese weisen auf mögliche Probleme während der Datenerhebung hin.
Data Mining¶
Was ist Data Mining?
- Data Mining ist der Prozess der Extraktion nützlicher Informationen aus einer Ansammlung von Daten, oft aus einem Data Warehouse oder einer Reihe von verknüpften Datensätzen.
- Data-Mining-Tools umfassen leistungsstarke statistische, mathematische und analytische Funktionen.
- Ihre primäre Aufgabe ist die Analyse großer Datenmengen, um Trends, Muster und Beziehungen zu erkennen, die eine fundierte Entscheidungsfindung und Planung ermöglichen.
Generelles Vorgehen:
- Verstehen des Problems
- Sammeln der Daten
- Aufbereitung der Daten und Verständnis
- Schulen der Anwender
Daten¶
Datenbereinigung¶
- Datenbereinigung, auch Data Cleaning, Data Cleansing oder Data Scrubbing genannt, ist der Prozess der Korrektur falscher, unvollständiger, doppelter oder anderweitig fehlerhafter Daten in einem Datensatz.
- Es geht darum, Datenfehler zu identifizieren und dann Daten zu ändern, zu aktualisieren oder zu entfernen, um sie zu korrigieren. Datenbereinigung verbessert die Datenqualität und trägt dazu bei, genauere, konsistentere und zuverlässigere Informationen für die Entscheidungsfindung in einem Unternehmen bereitzustellen.
Schritte:
- Inspektion und Profiling
- Bereinigung
- Verifizierung
- Reporting
Die bereinigten Daten können dann in die verbleibenden Phasen der Datenaufbereitung überführt werden, beginnend mit der Datenstrukturierung und Datentransformation, um sie für die Analysezwecke vorzubereiten.
Vorab¶
Bevor der Datensatz analysiert und vorbereitet wird, werden in diesem Kapitel einige Vorab-Informationen gegeben, welche ein Gesamtverständnis für den Leser einfacher machen.
Verwendete Bibliotheken¶
Neben den Standardbibliotheken:
pandasnumpymatplotlibseabornspacy
werden ebenso folgende Bibiliotheken verwendet:
re(Regular expression operations: Extrahieren von String-parts in Spalten)dashbzw.plotly express(Visualisieren von Daten in Dashboards)IPython.display
Installiert wurden die Pakete über pip, mit z.B.:
pip install spacypip install statsmodelspython -m spacy download en_core_web_smdas spacy Standardmodell
Logik (Dateien, Sicherungen)¶
Der gegebene Datensatz "Airbnb: Detailed Listings für New York City, New York, United States und Rome, Lazio, Italy (2025)" ist in vielerlei Hinsicht nicht sauber: Es existieren Null-Values, Rechtschreibfehler , es sind keine Werte für manche spalten gegeben und die Datenformate müssen teils konvertiert werden.
Um während des Bearbeitungsprozesses keine relevanten Daten zu verlieren - und auch eine Performance des Jupyter Notebooks zu gewährleisten - wurden Sicherungen nach relevanten Bearbeitungsschritten durchgeführt. Es existieren die Sicherungsdateien (in csv-Format):
data_clean
Data Understanding und Data Prepartion - New York¶
Im Folgenden wird die CRIPS Phase des Data Understandings im praktischen durchgeführt: Wir bekommen einen Einblick in die Struktur und den Aufbau des Datensatzes.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import statsmodels.formula.api as smf
import seaborn as sns
import spacy
import os, re, webbrowser
from dash import Dash, html, dash_table, dcc, callback, Output, Input
from IPython.display import HTML, IFrame, display
from plotly.subplots import make_subplots
from spacy import displacy
Erster Blick in die Daten¶
Zu Beginn wird ein Ersteindruck der Daten gewonnen:
- Wie sieht der Datensatz gesamt aus, welche Werte stehen oben, welche unten, was für Datentypen sind vorherrschend?
- Ebenso werden die einzelnen Spalten etwas genauer angesehen, aufgeteilt u.a. in numerische und nicht-numerische Spalten.
- Anschließend werden die numerischen Spalten genauer unter die Lupe genommen: Wie viele sind vorhanden, wie sehen das arithmetische Mittel, die Standardabweichung, die Quantile als auch die Max-Werte aus?
# New York: Datensatz einlesen + erster Blick in die Daten
file_NY = "data/NY/listings.csv"
data_NY = pd.read_csv(file_NY, low_memory=False)
# Head von data ansehen
data_NY.head(2)
| id | listing_url | scrape_id | last_scraped | source | name | description | neighborhood_overview | picture_url | host_id | ... | review_scores_communication | review_scores_location | review_scores_value | license | instant_bookable | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2539 | https://www.airbnb.com/rooms/2539 | 20250617032754 | 2025-06-17 | city scrape | Superfast Wi-Fi. Clean & quiet home by the park | Bright, serene room in a renovated apartment h... | Close to Prospect Park and Historic Ditmas Park | https://a0.muscache.com/pictures/hosting/Hosti... | 2787 | ... | 5.0 | 4.75 | 4.88 | NaN | f | 6 | 1 | 5 | 0 | 0.08 |
| 1 | 2595 | https://www.airbnb.com/rooms/2595 | 20250617032754 | 2025-06-17 | city scrape | Skylit Midtown Manhattan Spacious Studio | Beautiful, spacious skylit studio in the heart... | Centrally located in the heart of Manhattan ju... | https://a0.muscache.com/pictures/hosting/Hosti... | 2845 | ... | 4.8 | 4.81 | 4.40 | NaN | f | 3 | 3 | 0 | 0 | 0.26 |
2 rows × 79 columns
# Check DataFrame
type(data_NY)
pandas.core.frame.DataFrame
# tail von data ansehen
data_NY.tail(2)
| id | listing_url | scrape_id | last_scraped | source | name | description | neighborhood_overview | picture_url | host_id | ... | review_scores_communication | review_scores_location | review_scores_value | license | instant_bookable | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 36320 | 1444497972042807274 | https://www.airbnb.com/rooms/1444497972042807274 | 20250617032754 | 2025-06-17 | city scrape | Steps from the Empire State, Iconic Views, 2 Beds | Welcome to your dream Manhattan escape — a stu... | NaN | https://a0.muscache.com/pictures/hosting/Hosti... | 677540330 | ... | NaN | NaN | NaN | NaN | f | 2 | 2 | 0 | 0 | NaN |
| 36321 | 1444533606768940303 | https://www.airbnb.com/rooms/1444533606768940303 | 20250617032754 | 2025-06-17 | city scrape | 2BR/Brooklyn: Private Backyard | Spacious 2BR Brooklyn home in beautiful Bed-St... | NaN | https://a0.muscache.com/pictures/miso/Hosting-... | 10931359 | ... | NaN | NaN | NaN | NaN | f | 1 | 1 | 0 | 0 | NaN |
2 rows × 79 columns
# Genereller shape des Datensatzes
data_NY.shape
(36322, 79)
# New York: Genauere Infos zu Spalten, Datentypen
data_NY.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36322 entries, 0 to 36321 Data columns (total 79 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 36322 non-null int64 1 listing_url 36322 non-null object 2 scrape_id 36322 non-null int64 3 last_scraped 36322 non-null object 4 source 36322 non-null object 5 name 36320 non-null object 6 description 35374 non-null object 7 neighborhood_overview 19084 non-null object 8 picture_url 36322 non-null object 9 host_id 36322 non-null int64 10 host_url 36322 non-null object 11 host_name 36307 non-null object 12 host_since 36307 non-null object 13 host_location 28913 non-null object 14 host_about 20956 non-null object 15 host_response_time 21550 non-null object 16 host_response_rate 21550 non-null object 17 host_acceptance_rate 21720 non-null object 18 host_is_superhost 35830 non-null object 19 host_thumbnail_url 36307 non-null object 20 host_picture_url 36307 non-null object 21 host_neighbourhood 29034 non-null object 22 host_listings_count 36307 non-null float64 23 host_total_listings_count 36307 non-null float64 24 host_verifications 36307 non-null object 25 host_has_profile_pic 36307 non-null object 26 host_identity_verified 36307 non-null object 27 neighbourhood 19085 non-null object 28 neighbourhood_cleansed 36322 non-null object 29 neighbourhood_group_cleansed 36322 non-null object 30 latitude 36322 non-null float64 31 longitude 36322 non-null float64 32 property_type 36322 non-null object 33 room_type 36322 non-null object 34 accommodates 36322 non-null int64 35 bathrooms 21682 non-null float64 36 bathrooms_text 36286 non-null object 37 bedrooms 30274 non-null float64 38 beds 21620 non-null float64 39 amenities 36322 non-null object 40 price 21459 non-null object 41 minimum_nights 36322 non-null int64 42 maximum_nights 36322 non-null int64 43 minimum_minimum_nights 36322 non-null int64 44 maximum_minimum_nights 36322 non-null int64 45 minimum_maximum_nights 36322 non-null int64 46 maximum_maximum_nights 36322 non-null int64 47 minimum_nights_avg_ntm 36322 non-null float64 48 maximum_nights_avg_ntm 36322 non-null float64 49 calendar_updated 0 non-null float64 50 has_availability 30705 non-null object 51 availability_30 36322 non-null int64 52 availability_60 36322 non-null int64 53 availability_90 36322 non-null int64 54 availability_365 36322 non-null int64 55 calendar_last_scraped 36322 non-null object 56 number_of_reviews 36322 non-null int64 57 number_of_reviews_ltm 36322 non-null int64 58 number_of_reviews_l30d 36322 non-null int64 59 availability_eoy 36322 non-null int64 60 number_of_reviews_ly 36322 non-null int64 61 estimated_occupancy_l365d 36322 non-null int64 62 estimated_revenue_l365d 21459 non-null float64 63 first_review 25171 non-null object 64 last_review 25171 non-null object 65 review_scores_rating 25171 non-null float64 66 review_scores_accuracy 25161 non-null float64 67 review_scores_cleanliness 25170 non-null float64 68 review_scores_checkin 25157 non-null float64 69 review_scores_communication 25165 non-null float64 70 review_scores_location 25154 non-null float64 71 review_scores_value 25155 non-null float64 72 license 5329 non-null object 73 instant_bookable 36322 non-null object 74 calculated_host_listings_count 36322 non-null int64 75 calculated_host_listings_count_entire_homes 36322 non-null int64 76 calculated_host_listings_count_private_rooms 36322 non-null int64 77 calculated_host_listings_count_shared_rooms 36322 non-null int64 78 reviews_per_month 25171 non-null float64 dtypes: float64(19), int64(24), object(36) memory usage: 21.9+ MB
# Datentypen
data_NY.dtypes
id int64
listing_url object
scrape_id int64
last_scraped object
source object
...
calculated_host_listings_count int64
calculated_host_listings_count_entire_homes int64
calculated_host_listings_count_private_rooms int64
calculated_host_listings_count_shared_rooms int64
reviews_per_month float64
Length: 79, dtype: object
# Alle Columns anzeigen lassen
data_NY.columns
Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name',
'description', 'neighborhood_overview', 'picture_url', 'host_id',
'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
'host_response_time', 'host_response_rate', 'host_acceptance_rate',
'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
'host_neighbourhood', 'host_listings_count',
'host_total_listings_count', 'host_verifications',
'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
'maximum_minimum_nights', 'minimum_maximum_nights',
'maximum_maximum_nights', 'minimum_nights_avg_ntm',
'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability',
'availability_30', 'availability_60', 'availability_90',
'availability_365', 'calendar_last_scraped', 'number_of_reviews',
'number_of_reviews_ltm', 'number_of_reviews_l30d', 'availability_eoy',
'number_of_reviews_ly', 'estimated_occupancy_l365d',
'estimated_revenue_l365d', 'first_review', 'last_review',
'review_scores_rating', 'review_scores_accuracy',
'review_scores_cleanliness', 'review_scores_checkin',
'review_scores_communication', 'review_scores_location',
'review_scores_value', 'license', 'instant_bookable',
'calculated_host_listings_count',
'calculated_host_listings_count_entire_homes',
'calculated_host_listings_count_private_rooms',
'calculated_host_listings_count_shared_rooms', 'reviews_per_month'],
dtype='object')
# New York: Aufspaltung in numerische und nicht-numerische Spalten
# Numerische Spalten
data_numeric = data_NY.select_dtypes(include=[np.number])
numeric_cols = data_numeric.columns.values
print("\n", "1 - Numerische Spalten: ", "\n")
print(numeric_cols)
# Nicht-numerische Spalten
data_non_numeric = data_NY.select_dtypes(exclude=[np.number])
non_numeric_cols = data_non_numeric.columns.values
print("\n", "2 - Nicht-numerische Spalten: ", "\n")
print(non_numeric_cols)
1 - Numerische Spalten: ['id' 'scrape_id' 'host_id' 'host_listings_count' 'host_total_listings_count' 'latitude' 'longitude' 'accommodates' 'bathrooms' 'bedrooms' 'beds' 'minimum_nights' 'maximum_nights' 'minimum_minimum_nights' 'maximum_minimum_nights' 'minimum_maximum_nights' 'maximum_maximum_nights' 'minimum_nights_avg_ntm' 'maximum_nights_avg_ntm' 'calendar_updated' 'availability_30' 'availability_60' 'availability_90' 'availability_365' 'number_of_reviews' 'number_of_reviews_ltm' 'number_of_reviews_l30d' 'availability_eoy' 'number_of_reviews_ly' 'estimated_occupancy_l365d' 'estimated_revenue_l365d' 'review_scores_rating' 'review_scores_accuracy' 'review_scores_cleanliness' 'review_scores_checkin' 'review_scores_communication' 'review_scores_location' 'review_scores_value' 'calculated_host_listings_count' 'calculated_host_listings_count_entire_homes' 'calculated_host_listings_count_private_rooms' 'calculated_host_listings_count_shared_rooms' 'reviews_per_month'] 2 - Nicht-numerische Spalten: ['listing_url' 'last_scraped' 'source' 'name' 'description' 'neighborhood_overview' 'picture_url' 'host_url' 'host_name' 'host_since' 'host_location' 'host_about' 'host_response_time' 'host_response_rate' 'host_acceptance_rate' 'host_is_superhost' 'host_thumbnail_url' 'host_picture_url' 'host_neighbourhood' 'host_verifications' 'host_has_profile_pic' 'host_identity_verified' 'neighbourhood' 'neighbourhood_cleansed' 'neighbourhood_group_cleansed' 'property_type' 'room_type' 'bathrooms_text' 'amenities' 'price' 'has_availability' 'calendar_last_scraped' 'first_review' 'last_review' 'license' 'instant_bookable']
# Anzahl Numerische Spalten
data_NY.select_dtypes(include="number").shape[1]
43
# Anzahl Kategoriale Spalten
data_NY.select_dtypes(include=["object"]).shape[1]
36
# New York: Numerische Werte genauer ansehen
data_NY.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| id | 36322.0 | 4.382216e+17 | 5.199100e+17 | 2.539000e+03 | 2.121091e+07 | 5.004978e+07 | 9.277467e+17 | 1.444534e+18 |
| scrape_id | 36322.0 | 2.025062e+13 | 0.000000e+00 | 2.025062e+13 | 2.025062e+13 | 2.025062e+13 | 2.025062e+13 | 2.025062e+13 |
| host_id | 36322.0 | 1.713590e+08 | 1.893709e+08 | 1.678000e+03 | 1.764159e+07 | 8.336440e+07 | 3.036223e+08 | 7.005911e+08 |
| host_listings_count | 36307.0 | 2.441627e+02 | 9.666283e+02 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.000000e+01 | 4.925000e+03 |
| host_total_listings_count | 36307.0 | 3.406701e+02 | 1.240830e+03 | 1.000000e+00 | 1.000000e+00 | 3.000000e+00 | 1.400000e+01 | 9.109000e+03 |
| latitude | 36322.0 | 4.072856e+01 | 5.648223e-02 | 4.050037e+01 | 4.068828e+01 | 4.072594e+01 | 4.076235e+01 | 4.091139e+01 |
| longitude | 36322.0 | -7.394697e+01 | 5.525153e-02 | -7.425191e+01 | -7.398326e+01 | -7.395471e+01 | -7.392785e+01 | -7.371182e+01 |
| accommodates | 36322.0 | 2.751886e+00 | 1.891702e+00 | 1.000000e+00 | 2.000000e+00 | 2.000000e+00 | 4.000000e+00 | 1.600000e+01 |
| bathrooms | 21682.0 | 1.188382e+00 | 5.544393e-01 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.550000e+01 |
| bedrooms | 30274.0 | 1.383332e+00 | 9.381921e-01 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.500000e+01 |
| beds | 21620.0 | 1.636494e+00 | 1.193949e+00 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 4.000000e+01 |
| minimum_nights | 36322.0 | 2.920572e+01 | 3.511100e+01 | 1.000000e+00 | 3.000000e+01 | 3.000000e+01 | 3.000000e+01 | 1.124000e+03 |
| maximum_nights | 36322.0 | 6.024392e+04 | 1.126843e+07 | 1.000000e+00 | 1.200000e+02 | 3.650000e+02 | 1.125000e+03 | 2.147484e+09 |
| minimum_minimum_nights | 36322.0 | 2.904061e+01 | 3.496167e+01 | 1.000000e+00 | 3.000000e+01 | 3.000000e+01 | 3.000000e+01 | 1.124000e+03 |
| maximum_minimum_nights | 36322.0 | 3.843604e+01 | 6.464567e+01 | 1.000000e+00 | 3.000000e+01 | 3.000000e+01 | 3.000000e+01 | 1.124000e+03 |
| minimum_maximum_nights | 36322.0 | 2.968219e+05 | 2.519472e+07 | 1.000000e+00 | 3.600000e+02 | 3.650000e+02 | 1.125000e+03 | 2.147484e+09 |
| maximum_maximum_nights | 36322.0 | 8.880841e+05 | 4.363228e+07 | 1.000000e+00 | 3.650000e+02 | 7.300000e+02 | 1.125000e+03 | 2.147484e+09 |
| minimum_nights_avg_ntm | 36322.0 | 3.000185e+01 | 3.757597e+01 | 1.000000e+00 | 3.000000e+01 | 3.000000e+01 | 3.000000e+01 | 1.124000e+03 |
| maximum_nights_avg_ntm | 36322.0 | 5.537397e+05 | 3.055266e+07 | 1.000000e+00 | 3.650000e+02 | 5.791000e+02 | 1.125000e+03 | 2.147484e+09 |
| calendar_updated | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| availability_30 | 36322.0 | 8.725125e+00 | 1.180300e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.800000e+01 | 3.000000e+01 |
| availability_60 | 36322.0 | 2.072482e+01 | 2.406532e+01 | 0.000000e+00 | 0.000000e+00 | 7.000000e+00 | 4.600000e+01 | 6.000000e+01 |
| availability_90 | 36322.0 | 3.509735e+01 | 3.606974e+01 | 0.000000e+00 | 0.000000e+00 | 2.600000e+01 | 7.500000e+01 | 9.000000e+01 |
| availability_365 | 36322.0 | 1.590105e+02 | 1.455351e+02 | 0.000000e+00 | 0.000000e+00 | 1.530000e+02 | 3.100000e+02 | 3.650000e+02 |
| number_of_reviews | 36322.0 | 2.685304e+01 | 6.738206e+01 | 0.000000e+00 | 0.000000e+00 | 3.000000e+00 | 2.300000e+01 | 3.277000e+03 |
| number_of_reviews_ltm | 36322.0 | 4.017262e+00 | 2.008151e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 1.786000e+03 |
| number_of_reviews_l30d | 36322.0 | 3.030395e-01 | 1.483196e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.530000e+02 |
| availability_eoy | 36322.0 | 8.645298e+01 | 8.021589e+01 | 0.000000e+00 | 0.000000e+00 | 8.500000e+01 | 1.690000e+02 | 1.980000e+02 |
| number_of_reviews_ly | 36322.0 | 3.664886e+00 | 1.911530e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 1.797000e+03 |
| estimated_occupancy_l365d | 36322.0 | 4.784547e+01 | 8.546655e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 6.000000e+01 | 2.550000e+02 |
| estimated_revenue_l365d | 21459.0 | 1.523824e+04 | 9.763701e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.851000e+04 | 1.275000e+07 |
| review_scores_rating | 25171.0 | 4.725706e+00 | 4.508312e-01 | 0.000000e+00 | 4.650000e+00 | 4.860000e+00 | 5.000000e+00 | 5.000000e+00 |
| review_scores_accuracy | 25161.0 | 4.765266e+00 | 4.377228e-01 | 0.000000e+00 | 4.710000e+00 | 4.900000e+00 | 5.000000e+00 | 5.000000e+00 |
| review_scores_cleanliness | 25170.0 | 4.658376e+00 | 4.999087e-01 | 0.000000e+00 | 4.540000e+00 | 4.810000e+00 | 5.000000e+00 | 5.000000e+00 |
| review_scores_checkin | 25157.0 | 4.835495e+00 | 3.758870e-01 | 0.000000e+00 | 4.820000e+00 | 4.950000e+00 | 5.000000e+00 | 5.000000e+00 |
| review_scores_communication | 25165.0 | 4.824200e+00 | 4.155085e-01 | 0.000000e+00 | 4.820000e+00 | 4.960000e+00 | 5.000000e+00 | 5.000000e+00 |
| review_scores_location | 25154.0 | 4.744684e+00 | 3.895678e-01 | 0.000000e+00 | 4.650000e+00 | 4.850000e+00 | 5.000000e+00 | 5.000000e+00 |
| review_scores_value | 25155.0 | 4.640968e+00 | 4.918495e-01 | 0.000000e+00 | 4.530000e+00 | 4.760000e+00 | 4.940000e+00 | 5.000000e+00 |
| calculated_host_listings_count | 36322.0 | 6.070415e+01 | 1.950884e+02 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 8.000000e+00 | 1.054000e+03 |
| calculated_host_listings_count_entire_homes | 36322.0 | 4.129800e+01 | 1.801254e+02 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.054000e+03 |
| calculated_host_listings_count_private_rooms | 36322.0 | 1.854292e+01 | 8.345715e+01 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 2.000000e+00 | 5.920000e+02 |
| calculated_host_listings_count_shared_rooms | 36322.0 | 3.317549e-02 | 5.371839e-01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.700000e+01 |
| reviews_per_month | 25171.0 | 8.172615e-01 | 1.850794e+00 | 1.000000e-02 | 8.000000e-02 | 2.600000e-01 | 9.400000e-01 | 1.219700e+02 |
# New York: Describe von Spalte Location
data_NY["bathrooms"].describe()
count 21682.000000 mean 1.188382 std 0.554439 min 0.000000 25% 1.000000 50% 1.000000 75% 1.000000 max 15.500000 Name: bathrooms, dtype: float64
# New York: Describe von Spalte Summary
data_NY["availability_365"].describe()
count 36322.000000 mean 159.010545 std 145.535113 min 0.000000 25% 0.000000 50% 153.000000 75% 310.000000 max 365.000000 Name: availability_365, dtype: float64
# New York: Describe von Spalte
data_NY["host_since"].describe()
count 36307 unique 5159 top 2016-12-16 freq 1060 Name: host_since, dtype: object
# Datums-Typen bearbeiten: object -> datetime
if "host_since" in data_NY.columns:
data_NY["host_since"] = pd.to_datetime(data_NY["host_since"], errors="coerce")
data_NY["host_since"].describe()
count 36307 mean 2017-03-13 15:15:35.522075904 min 2008-08-11 00:00:00 25% 2014-07-07 00:00:00 50% 2016-07-14 00:00:00 75% 2019-10-21 00:00:00 max 2025-06-10 00:00:00 Name: host_since, dtype: object
### Zweiter Blick - Beispiele: Verfügbarkeit, Ausstattung, Host<a class="anchor" id="4.2"></a>
# Blick in die Daten
# Ausstattung New York ("amenities") von bestimmten loc(ation) lesen
data_NY["amenities"].loc[3010]
'["Pets allowed", "Essentials", "Kitchen", "First aid kit", "Smoke alarm", "TV with standard cable", "Elevator", "Wifi", "Carbon monoxide alarm", "Air conditioning", "Heating"]'
# Blick in die Daten
# Absturz von Ausstattung New York ("amenities") oben
data_NY.loc[3010]
id 6179500
listing_url https://www.airbnb.com/rooms/6179500
scrape_id 20250617032754
last_scraped 2025-06-18
source previous scrape
...
calculated_host_listings_count 1
calculated_host_listings_count_entire_homes 0
calculated_host_listings_count_private_rooms 1
calculated_host_listings_count_shared_rooms 0
reviews_per_month 0.01
Name: 3010, Length: 79, dtype: object
# Blick in die Daten
# "host_since" - type object
# Datums-Typen bearbeiten
data_NY["host_since"].head(1)
0 2008-09-07 Name: host_since, dtype: datetime64[ns]
# Blick in die Daten
if "host_since" in data_NY.columns:
data_NY["host_since"] = pd.to_datetime(data_NY["host_since"], errors="coerce")
data_NY["host_since"].head(10)
0 2008-09-07 1 2008-09-09 2 2009-02-03 3 2009-05-06 4 2009-05-07 5 2009-05-12 6 2009-05-15 7 2009-05-17 8 2010-08-17 9 2010-05-20 Name: host_since, dtype: datetime64[ns]
# Blick in die Daten
# Hosts seit "2016-05-20"
data_NY.loc[data_NY["host_since"]=="2016-05-20"][["name", "host_since"]]
| name | host_since | |
|---|---|---|
| 6229 | On Fifth, Across From Central Park | 2016-05-20 |
| 6256 | 2br 1 bath FURN kit MIN 1 month avail Feb1-28 | 2016-05-20 |
| 6257 | STUNNING ONE BEDROOM IN THE HEART OF NEW YORK ... | 2016-05-20 |
| 8822 | Spacious Brooklyn Bedroom Close to Manhattan e... | 2016-05-20 |
| 11187 | Midtown West | 2016-05-20 |
| 12427 | 2 Bed / 2 Bath Apartment in Beautiful Bed-Stuy | 2016-05-20 |
| 14707 | 2 BR, 2BA in prime brownstone Brooklyn location | 2016-05-20 |
| 15318 | Boho Chic in The Bronx / The Green Room | 2016-05-20 |
| 16719 | Boho Chic in The Bronx / The Gold Room | 2016-05-20 |
| 18912 | Boho Chic in The Bronx/The Blue Room | 2016-05-20 |
| 23514 | Luxe Brooklyn Suite | 2016-05-20 |
# Blick in die Daten
# Top 5 der höchsten Verfügbarkeit (availability_365)
data_5first_availability_365 = data_NY.sort_values(by=["availability_365"], ascending=False).head(5)
data_5first_availability_365[["name", "availability_365", "host_since", "neighbourhood_group_cleansed"]]
| name | availability_365 | host_since | neighbourhood_group_cleansed | |
|---|---|---|---|---|
| 22532 | NYC Astoria-LaGuardia #4 TinyPrivateRoom Singl... | 365 | 2019-10-17 | Queens |
| 25501 | Luxury & design Townhouse - NY | 365 | 2019-10-17 | Brooklyn |
| 15253 | Renovated Apartment with high ceilings! | 365 | 2020-01-23 | Brooklyn |
| 25445 | Large Room in Queens, 15 mins to Manhattan #485 | 365 | 2012-08-11 | Queens |
| 25446 | Queens New HDTV Room, 15 min to Manhattan #486 | 365 | 2012-08-11 | Queens |
# Blick in die Daten
# Top 5 der niedrigsten Verfügbarkeit (availability_365)
data_5last_availability_365 = data_NY.sort_values(by=["availability_365"], ascending=False).tail(5)
data_5last_availability_365[["name", "availability_365", "host_since", "neighbourhood_group_cleansed"]]
| name | availability_365 | host_since | neighbourhood_group_cleansed | |
|---|---|---|---|---|
| 11578 | BEAUTIFUL Water View Aparment in Financial | 0 | 2018-10-31 | Manhattan |
| 11577 | Parkway Abode | 0 | 2013-12-27 | Brooklyn |
| 11575 | QUIET Zen Garden Pad in BEST Brooklyn location! | 0 | 2010-12-21 | Brooklyn |
| 11573 | Cozy NYC Downtown Apartment | 0 | 2014-03-21 | Manhattan |
| 8769 | Penthouse Apt with Incredible Views & Location | 0 | 2014-08-30 | Manhattan |
# Blick in die Daten
# Top 5 "availability_365" mit style.bar
data_5first_availability_365[["name", "availability_365", "host_since", "neighbourhood_group_cleansed"]].style.bar(subset=["availability_365"])
| name | availability_365 | host_since | neighbourhood_group_cleansed | |
|---|---|---|---|---|
| 22532 | NYC Astoria-LaGuardia #4 TinyPrivateRoom SingleBed | 365 | 2019-10-17 00:00:00 | Queens |
| 25501 | Luxury & design Townhouse - NY | 365 | 2019-10-17 00:00:00 | Brooklyn |
| 15253 | Renovated Apartment with high ceilings! | 365 | 2020-01-23 00:00:00 | Brooklyn |
| 25445 | Large Room in Queens, 15 mins to Manhattan #485 | 365 | 2012-08-11 00:00:00 | Queens |
| 25446 | Queens New HDTV Room, 15 min to Manhattan #486 | 365 | 2012-08-11 00:00:00 | Queens |
# Blick in die Daten
# Check Summe von number_of_reviews
print("number_of_reviews: ", data_NY["number_of_reviews"].sum())
# Check Summe von property_type
print("property_type: ", len(data_NY["property_type"].unique()))
# Check Summe von neighbourhood_group_cleansed
print("neighbourhood_group_cleansed: ", len(data_NY["neighbourhood_group_cleansed"].unique()))
number_of_reviews: 975356 property_type: 74 neighbourhood_group_cleansed: 5
# Investigate Room Type
data_NY["room_type"].unique()
array(['Private room', 'Entire home/apt', 'Hotel room', 'Shared room'],
dtype=object)
Deskriptive Statistik:¶
In diesem Teil werden wir uns den folgenden Kernthemen zuwenden:
- Drop Spälte mit reine URLs, Bilder, Host Name, Host-ID, Host Location, Bescreibung
- Konvertierung der "Date"-Spalte und boolische Spalte
- Behandlung von Null-Values
# Drop Spälte mit reine URLs, Bilder, Host Name, Host-ID, Host Location, Bescreibung
# Spalte rehalten Info über lätzte Scrape
# calculated_host_listings_count
# calculated_host_listings_count_entire_homes
# calculated_host_listings_count_private_rooms
# calculated_host_listings_count_shared_rooms
# Spalte "host_name" - Name of the host. Usually just the first name(s).
# Spalten "host_listings_count", "host_total_listings_count": The number of listings the host has (per Airbnb unknown calculations)
# Spalte brauche für Analyse nicht:
drop_columns = ["listing_url",
"scrape_id",
"last_scraped",
"source",
"description",
"neighborhood_overview",
"picture_url",
"host_url",
"host_name",
"host_about",
"host_thumbnail_url",
"host_picture_url",
"host_listings_count",
"host_total_listings_count",
"host_has_profile_pic",
"calendar_last_scraped",
"calculated_host_listings_count",
"calculated_host_listings_count_entire_homes",
"calculated_host_listings_count_private_rooms",
"calculated_host_listings_count_shared_rooms"
]
df_NY = data_NY.drop(columns=drop_columns, axis=1, errors="ignore")
print(df_NY[["host_identity_verified","instant_bookable","has_availability"]])
# Bool-Typen bearbeiten
def to_bool(x):
if pd.isna(x):
return pd.NA
s = str(x).strip().lower()
if s == "t" or x == True: return True
if s == "f" or x == False: return False
return pd.NA
col_bool = ["host_identity_verified","instant_bookable","has_availability"]
for c in ["host_identity_verified", "instant_bookable","has_availability"]:
if c in df_NY.columns:
df_NY[c] = df_NY[c].map(to_bool).astype("boolean")
host_identity_verified instant_bookable has_availability 0 t f t 1 t f t 2 t t t 3 t f t 4 f f t ... ... ... ... 36317 t t t 36318 t t t 36319 t f t 36320 t f t 36321 f f t [36322 rows x 3 columns]
print(df_NY[["host_identity_verified","instant_bookable","has_availability"]].head())
host_identity_verified instant_bookable has_availability 0 True False True 1 True False True 2 True True True 3 True False True 4 False False True
# Spalte "license" - Werte [nan, 'Exempt', "OSE-*"]
print(df_NY["license"].unique()[:5])
[i for i in df_NY["license"].unique() if str(i)[:4].upper() != "OSE-"]
[nan 'OSE-STRREG-0000008' 'OSE-STRREG-0000923' 'OSE-STRREG-0000656' 'OSE-STRREG-0000108']
[nan, 'Exempt']
#Check for the missing values
df_NY.isna().any()
id False name True host_id False host_since True host_location True host_response_time True host_response_rate True host_acceptance_rate True host_is_superhost True host_neighbourhood True host_verifications True host_identity_verified True neighbourhood True neighbourhood_cleansed False neighbourhood_group_cleansed False latitude False longitude False property_type False room_type False accommodates False bathrooms True bathrooms_text True bedrooms True beds True amenities False price True minimum_nights False maximum_nights False minimum_minimum_nights False maximum_minimum_nights False minimum_maximum_nights False maximum_maximum_nights False minimum_nights_avg_ntm False maximum_nights_avg_ntm False calendar_updated True has_availability True availability_30 False availability_60 False availability_90 False availability_365 False number_of_reviews False number_of_reviews_ltm False number_of_reviews_l30d False availability_eoy False number_of_reviews_ly False estimated_occupancy_l365d False estimated_revenue_l365d True first_review True last_review True review_scores_rating True review_scores_accuracy True review_scores_cleanliness True review_scores_checkin True review_scores_communication True review_scores_location True review_scores_value True license True instant_bookable False reviews_per_month True dtype: bool
# Zusätchlive Funktionen
# Missing Values
missing = df_NY.isnull().sum().to_frame()
def missing_values(df):
missing["Missing_values"] = df_NY.isnull().sum().to_frame()
missing["Total_perc"] = (missing["Missing_values"] / len(df)) * 100
return missing
missing_values(df_NY)[missing["Total_perc"] > 0].Total_perc
name 0.005506 host_since 0.041297 host_location 20.398106 host_response_time 40.669567 host_response_rate 40.669567 host_acceptance_rate 40.201531 host_is_superhost 1.354551 host_neighbourhood 20.064974 host_verifications 0.041297 host_identity_verified 0.041297 neighbourhood 47.456087 bathrooms 40.306151 bathrooms_text 0.099113 bedrooms 16.651065 beds 40.476846 price 40.920104 calendar_updated 100.000000 has_availability 15.464457 estimated_revenue_l365d 40.920104 first_review 30.700402 last_review 30.700402 review_scores_rating 30.700402 review_scores_accuracy 30.727933 review_scores_cleanliness 30.703155 review_scores_checkin 30.738946 review_scores_communication 30.716921 review_scores_location 30.747206 review_scores_value 30.744452 license 85.328451 reviews_per_month 30.700402 Name: Total_perc, dtype: float64
# Spalte: "calendar_updated" Missing-Rate 100%
df_NY["calendar_updated"].unique()
# Spalte "neighbourhood" has nur zwei Werte: ['Neighborhood highlights', nan]
df_NY["neighbourhood"].unique()
# Missing-Rate > 40% and no description in the data Doctionary:
# host_is_superhost
# host_response_time
# host_response_rate
array(['Neighborhood highlights', nan], dtype=object)
drop_columns = ["host_response_time",
"host_response_rate",
"host_is_superhost",
"neighbourhood",
"calendar_updated",
]
df_NY = df_NY.drop(columns=drop_columns, axis=1, errors="ignore")
# print(df_NY.info())
# Median
df_NY.median(numeric_only=True)
id 50049775.5 host_id 83364401.0 host_identity_verified 1.0 latitude 40.725941 longitude -73.954711 accommodates 2.0 bathrooms 1.0 bedrooms 1.0 beds 1.0 minimum_nights 30.0 maximum_nights 365.0 minimum_minimum_nights 30.0 maximum_minimum_nights 30.0 minimum_maximum_nights 365.0 maximum_maximum_nights 730.0 minimum_nights_avg_ntm 30.0 maximum_nights_avg_ntm 579.1 has_availability 1.0 availability_30 0.0 availability_60 7.0 availability_90 26.0 availability_365 153.0 number_of_reviews 3.0 number_of_reviews_ltm 0.0 number_of_reviews_l30d 0.0 availability_eoy 85.0 number_of_reviews_ly 0.0 estimated_occupancy_l365d 0.0 estimated_revenue_l365d 0.0 review_scores_rating 4.86 review_scores_accuracy 4.9 review_scores_cleanliness 4.81 review_scores_checkin 4.95 review_scores_communication 4.96 review_scores_location 4.85 review_scores_value 4.76 instant_bookable 0.0 reviews_per_month 0.26 dtype: Float64
# Standard Deviation
df_NY.std(numeric_only=True)
id 519910010100084032.0 host_id 189370891.502614 host_identity_verified 0.332515 latitude 0.056482 longitude 0.055252 accommodates 1.891702 bathrooms 0.554439 bedrooms 0.938192 beds 1.193949 minimum_nights 35.111004 maximum_nights 11268434.763247 minimum_minimum_nights 34.961666 maximum_minimum_nights 64.645669 minimum_maximum_nights 25194723.611818 maximum_maximum_nights 43632280.118655 minimum_nights_avg_ntm 37.575971 maximum_nights_avg_ntm 30552660.280436 has_availability 0.0 availability_30 11.803004 availability_60 24.065318 availability_90 36.069741 availability_365 145.535113 number_of_reviews 67.382057 number_of_reviews_ltm 20.081506 number_of_reviews_l30d 1.483196 availability_eoy 80.215889 number_of_reviews_ly 19.115302 estimated_occupancy_l365d 85.46655 estimated_revenue_l365d 97637.007729 review_scores_rating 0.450831 review_scores_accuracy 0.437723 review_scores_cleanliness 0.499909 review_scores_checkin 0.375887 review_scores_communication 0.415509 review_scores_location 0.389568 review_scores_value 0.49185 instant_bookable 0.398836 reviews_per_month 1.850794 dtype: Float64
# Mean of "accommodates"
df_NY["accommodates"].mean().round(2)
np.float64(2.75)
# Mean of "minimum_nights" and "maximum_nights"
df_NY["minimum_nights"].mean().round(2), df_NY["maximum_nights"].mean().round(2)
(np.float64(29.21), np.float64(60243.92))
#IQR (Interquartile Range)
num = df_NY.select_dtypes(include=[np.floating, np.integer])
# num.quantile(0.50, numeric_only=True)
num.quantile(0.75) - num.quantile(0.25)
id 9.277467e+17 host_id 2.859807e+08 latitude 7.406750e-02 longitude 5.540750e-02 accommodates 2.000000e+00 bathrooms 0.000000e+00 bedrooms 1.000000e+00 beds 1.000000e+00 minimum_nights 0.000000e+00 maximum_nights 1.005000e+03 minimum_minimum_nights 0.000000e+00 maximum_minimum_nights 0.000000e+00 minimum_maximum_nights 7.650000e+02 maximum_maximum_nights 7.600000e+02 minimum_nights_avg_ntm 0.000000e+00 maximum_nights_avg_ntm 7.600000e+02 availability_30 1.800000e+01 availability_60 4.600000e+01 availability_90 7.500000e+01 availability_365 3.100000e+02 number_of_reviews 2.300000e+01 number_of_reviews_ltm 1.000000e+00 number_of_reviews_l30d 0.000000e+00 availability_eoy 1.690000e+02 number_of_reviews_ly 1.000000e+00 estimated_occupancy_l365d 6.000000e+01 estimated_revenue_l365d 1.851000e+04 review_scores_rating 3.500000e-01 review_scores_accuracy 2.900000e-01 review_scores_cleanliness 4.600000e-01 review_scores_checkin 1.800000e-01 review_scores_communication 1.800000e-01 review_scores_location 3.500000e-01 review_scores_value 4.100000e-01 reviews_per_month 8.600000e-01 dtype: float64
df_NY.describe(include=['object', 'category'])
| name | host_location | host_acceptance_rate | host_neighbourhood | host_verifications | neighbourhood_cleansed | neighbourhood_group_cleansed | property_type | room_type | bathrooms_text | amenities | price | first_review | last_review | license | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 36320 | 28913 | 21720 | 29034 | 36307 | 36322 | 36322 | 36322 | 36322 | 36286 | 36322 | 21459 | 25171 | 25171 | 5329 |
| unique | 34732 | 985 | 100 | 535 | 7 | 223 | 5 | 74 | 4 | 33 | 29619 | 1101 | 4416 | 3322 | 1963 |
| top | Water View King Bed Hotel Room | New York, NY | 100% | Bedford-Stuyvesant | ['email', 'phone'] | Bedford-Stuyvesant | Manhattan | Entire rental unit | Entire home/apt | 1 bath | ["Washer", "Kitchen", "Smoke alarm", "TV", "Wi... | $150.00 | 2023-01-01 | 2025-05-31 | Exempt |
| freq | 30 | 21784 | 5681 | 1900 | 28107 | 2615 | 16081 | 15335 | 19435 | 18449 | 225 | 318 | 61 | 410 | 3052 |
Fehlende Daten - Null-Werte¶
Mithilfe von isnull.().sum() erkennen wir, dass fast alle Spalten Null-Werte haben:
df_NY.isnull().sum()
id 0 name 2 host_id 0 host_since 15 host_location 7409 host_acceptance_rate 14602 host_neighbourhood 7288 host_verifications 15 host_identity_verified 15 neighbourhood_cleansed 0 neighbourhood_group_cleansed 0 latitude 0 longitude 0 property_type 0 room_type 0 accommodates 0 bathrooms 14640 bathrooms_text 36 bedrooms 6048 beds 14702 amenities 0 price 14863 minimum_nights 0 maximum_nights 0 minimum_minimum_nights 0 maximum_minimum_nights 0 minimum_maximum_nights 0 maximum_maximum_nights 0 minimum_nights_avg_ntm 0 maximum_nights_avg_ntm 0 has_availability 5617 availability_30 0 availability_60 0 availability_90 0 availability_365 0 number_of_reviews 0 number_of_reviews_ltm 0 number_of_reviews_l30d 0 availability_eoy 0 number_of_reviews_ly 0 estimated_occupancy_l365d 0 estimated_revenue_l365d 14863 first_review 11151 last_review 11151 review_scores_rating 11151 review_scores_accuracy 11161 review_scores_cleanliness 11152 review_scores_checkin 11165 review_scores_communication 11157 review_scores_location 11168 review_scores_value 11167 license 30993 instant_bookable 0 reviews_per_month 11151 dtype: int64
# Anzahl unterschiedlicher Zeilen: 36322
# Keine Duplikate
df_NY.drop_duplicates().shape[0]
36322
Spalten Price, Bathrooms, has_availability säubern¶
Spalte "price" bereinigen und fehlende Werte ausfüllen¶
Fehlende Werte 40.9%
Ziel: Die Spalte price sauber in Float konvertieren, Ausreißer entfernen und fehlende Werte hierarchischer Median-basiert imputieren.
Vorgehen:
- Preis säubern: in Float umwandeln; Währungszeichen und Tausendertrennzeichen entfernen.
- Unplausible Werte:
price <= 0oder sehr hoch (z. B. > 99,5%-Quantil) →NaN. - Check vor Imputation: Verteilungen/Kennzahlen vor dem Füllen prüfen.
- Imputation: hierarchischer Median.
- Check nach Imputation: Verteilungen/Kennzahlen nach dem Füllen prüfen.
print("Missing price before imputation:", df_NY["price"].isna().sum())
Missing price before imputation: 14863
# Investigate "price" column
# df_NY[df_NY["price"].isna() == True].loc[24]
df_NY["price"].loc[20:26]
def find_currency(df):
currency = []
for i in df_NY["price"].unique():
currency.append(str(i)[0])
currency.remove('n')
return set(currency)
find_currency(df_NY)
{'$'}
# Preis säubern: object -> float
df_NY["price"] = (df_NY["price"].astype(str)
.str.replace(r"[^\d.\-]", "", regex=True)
.replace({"": np.nan}))
df_NY["price"] = pd.to_numeric(df_NY["price"], errors="coerce")
# Offensichtliche Fehler -> NaN
df_NY.loc[df_NY["price"] <= 0, "price"] = np.nan
upper_cap = df_NY["price"].quantile(0.995) # sehr extreme Spitzen
# Extrem Werte als NaN behandeln
df_NY.loc[df_NY["price"] > upper_cap, "price"] = np.nan
# Hierarchische Median-Imputation (fein)
groupings = [
["neighbourhood_group_cleansed","room_type","accommodates"],
["neighbourhood_group_cleansed","room_type"],
["room_type","accommodates"],
["room_type"],
]
for cols in groupings:
cols = [c for c in cols if c in df_NY.columns]
if not cols:
continue
med = df_NY.groupby(cols)["price"].transform("median")
need = df_NY["price"].isna() & med.notna()
df_NY.loc[need, "price"] = med[need]
# Fallback: globaler Median (grob)
# if df_NY["price"].isna().any():
# df_NY["price"] = df_NY["price"].fillna(df_NY["price"].median())
# Check
print("Missing price after imputation:", df_NY["price"].isna().sum())
Missing price after imputation: 0
Spalte "bathrooms" bereinigen und fehlende Werte aus "bathrooms_text" füllen¶
Fehlende Werte 40.3%
Ziel: Die numerische Spalte bathrooms (float) vervollständigen, indem wir fehlende Werte aus der textuellen Spalte bathrooms_text ableiten. Beide Felder beschreiben dieselbe Information.
Vorgehen:
- Parsing-Regeln:
- Zahl direkt vor bath/bathrooms (z. B.
1 bath,1.5 baths,2 bathrooms,0 baths). - half-bath ohne Zahl ist gleich 0.5.
- Zusätze wie private/shared werden ignoriert.
- Zahl direkt vor bath/bathrooms (z. B.
- Füllen: Nur dort ergänzen, wo
bathroomsNaN ist – existierende numerische Werte werden nicht überschrieben. - Check: Wie viele Zeilen gefüllt, wie viele fehlen noch.
df_NY[df_NY["bathrooms"].notna()][["bathrooms", "bathrooms_text"]].head(10)
| bathrooms | bathrooms_text | |
|---|---|---|
| 0 | 1.0 | 1 private bath |
| 1 | 1.0 | 1 bath |
| 2 | 1.5 | 1.5 baths |
| 3 | 1.0 | 1 bath |
| 5 | 1.0 | 1 shared bath |
| 7 | 1.0 | 1 private bath |
| 8 | 1.0 | 1 bath |
| 11 | 1.0 | 1 bath |
| 14 | 1.0 | 1 bath |
| 16 | 1.5 | 1.5 shared baths |
# Investigate
df_NY["bathrooms"].unique()
array([ 1. , 1.5, nan, 2. , 2.5, 3. , 5. , 0. , 0.5, 3.5, 15.5,
10.5, 4. , 4.5, 5.5, 6. , 7. , 7.5, 9. , 6.5])
# Investigate
df_NY["bathrooms_text"].unique()
array(['1 private bath', '1 bath', '1.5 baths', '1 shared bath',
'1.5 shared baths', '2 baths', '2.5 baths', nan, '3 baths',
'5 baths', '0 shared baths', '2 shared baths', 'Shared half-bath',
'3.5 baths', '2.5 shared baths', 'Half-bath', 'Private half-bath',
'0 baths', '15.5 baths', '3 shared baths', '10.5 baths', '4 baths',
'4.5 baths', '3.5 shared baths', '4 shared baths',
'4.5 shared baths', '5.5 baths', '6 baths', '7 baths',
'6 shared baths', '7.5 baths', '9 baths', '5 shared baths',
'6.5 baths'], dtype=object)
# Investigate: "bathrooms", "bathrooms_text"
df_bathrooms_null_bathrooms_text = df_NY[df_NY["bathrooms"].isnull()][["bathrooms", "bathrooms_text"]]
df_bathrooms_null_bathrooms_text.head(10)
# df_bathrooms_null_bathrooms_text.index.values
idx = df_bathrooms_null_bathrooms_text.index.values
# fill df_NY["bathrooms"] from df_NY["bathrooms_text"] only where it’s missing
# (handles “1 bath”, “1.5 baths”, “Half-bath”, “shared half-bath”, “2 bathrooms”, etc.):
pattern = re.compile(r"""(?ix)
^\s* # allow leading spaces
(?: # start alternation
(?P<num>\d+(?:\.\d+)?)\s* # 1, 1.5, 0, 15.5
(?: (?:private|shared)\s+ )? # optional qualifier after number
bath(?:room)?s? # bath / bathroom(s)
|
(?: (?:private|shared)\s+ )? # optional qualifier before 'half'
half[-\s]?bath(?:room)?s? # half-bath / half bathroom(s)
)
\s*$ # allow trailing spaces
""")
def parse_bath(text: str):
"""Parse bath count from bathrooms_text (your listed formats)."""
if not isinstance(text, str) or not text.strip():
return np.nan
m = pattern.match(text.strip())
if not m:
return np.nan
if m.group('num') is not None: # numeric forms: '1 bath', '1.5 shared baths', '0 baths'
return float(m.group('num'))
else: # 'half-bath', 'shared half-bath', 'private half-bath'
return 0.5
# fill only missing bathrooms from parsed bathrooms_text
df_NY["bathrooms"] = df_NY["bathrooms"].fillna(df_NY["bathrooms_text"].apply(parse_bath))
# Missing values
df_NY["bathrooms"].isna().sum()
np.int64(25)
df_NY["bathrooms"].unique()
array([ 1. , 1.5, 2. , 2.5, nan, 3. , 5. , 0. , 0.5, 3.5, 15.5,
10.5, 4. , 4.5, 5.5, 6. , 7. , 7.5, 9. , 6.5])
df_bathrooms_null_bathrooms_text_filled = df_NY[df_NY["bathrooms"].isnull()][["bathrooms", "bathrooms_text"]]
len(df_bathrooms_null_bathrooms_text_filled)
25
df_bathrooms_null_bathrooms_text_filled.index.values
array([ 59, 389, 443, 457, 469, 828, 849, 2347, 2503,
3018, 5822, 14940, 17366, 17367, 17381, 20029, 20054, 20057,
28437, 30861, 33572, 34513, 34973, 35625, 35811])
df_NY = df_NY.drop("bathrooms_text", axis=1)
Spalte "host_neighbourhood" und "host_location" untersuchen¶
Fehlende Werte > 20%
Ziel: Die Spalte host_neighbourhood und host_location aufbereiten.
Vorgehen:
- Daten:
host_neighbourhoodmitneighbourhood_cleansedvergleichen. - Daten:
host_neighbourhooddropen. - Daten:
host_locationmitatitudeundlongitudevergleichen. - Daten:
host_locationdropen.
# Investigate: no description in the data Doctionary
df_NY["host_neighbourhood"].unique()[:10]
# array(['Gravesend', 'Midtown', 'Greenwood Heights', 'Williamsburg',
# 'East Harlem', 'Fort Greene', 'Ridgewood', 'Alphabet City',
# 'Greenwich Village', 'Harlem', ...], dtype=object)
array(['Gravesend', 'Midtown', 'Greenwood Heights', 'Williamsburg',
'East Harlem', 'Fort Greene', 'Ridgewood', 'Alphabet City',
'Greenwich Village', 'Harlem'], dtype=object)
df_NY[df_NY["host_neighbourhood"].notnull()][["host_neighbourhood", "neighbourhood_cleansed"]].head(10)
| host_neighbourhood | neighbourhood_cleansed | |
|---|---|---|
| 0 | Gravesend | Kensington |
| 1 | Midtown | Midtown |
| 2 | Greenwood Heights | Sunset Park |
| 3 | Williamsburg | Williamsburg |
| 4 | East Harlem | East Harlem |
| 5 | East Harlem | East Harlem |
| 6 | Williamsburg | Williamsburg |
| 7 | Fort Greene | Fort Greene |
| 8 | Ridgewood | Ridgewood |
| 9 | Alphabet City | East Village |
len(df_NY[df_NY["host_neighbourhood"] == df_NY["neighbourhood_cleansed"]][["host_neighbourhood", "neighbourhood_cleansed"]])
16173
len(df_NY[df_NY["host_neighbourhood"] != df_NY["neighbourhood_cleansed"]][["host_neighbourhood", "neighbourhood_cleansed"]])
20149
# Drop "host_neighbourhood". The reason: we have a column "neighbourhood_cleansed" with correct location
df_NY = df_NY.drop("host_neighbourhood", axis=1)
# Description in the data Doctionary: The host's self reported location
df_NY["host_location"].unique()[:10]
# array(['New York, NY', 'Woodstock, NY', ...])
array(['New York, NY', 'Woodstock, NY', 'Las Vegas, NV',
'New York, United States', 'Great Neck, NY', 'Berkeley, CA',
'Los Angeles, CA', nan, 'Montreal, Canada', 'United States'],
dtype=object)
# Compare the column "host_location" with "latitude" and "longitude"
df_NY[df_NY["host_location"].notnull()][["host_location", "latitude", "longitude"]].head(10)
| host_location | latitude | longitude | |
|---|---|---|---|
| 0 | New York, NY | 40.64529 | -73.97238 |
| 1 | Woodstock, NY | 40.75356 | -73.98559 |
| 2 | New York, NY | 40.66265 | -73.99454 |
| 3 | New York, NY | 40.70935 | -73.95342 |
| 4 | New York, NY | 40.80107 | -73.94255 |
| 5 | New York, NY | 40.78778 | -73.94759 |
| 6 | New York, NY | 40.71248 | -73.95881 |
| 7 | New York, NY | 40.69194 | -73.97389 |
| 8 | New York, NY | 40.70271 | -73.89930 |
| 9 | New York, NY | 40.72530 | -73.98028 |
# Investigate the column "host_location"
# Convert type df_NY["host_location"] to string
df_NY["host_location"] = df_NY["host_location"].astype("string")
df_NY[df_NY["host_location"] == "Calgary, Canada"][["host_location", "latitude", "longitude"]]
# host_location latitude longitude
# 34152 Calgary, Canada 40.678622 -73.946671
# BUT IT'S NY !!!!
df_NY[df_NY["host_location"] == "Vila Velha, Brazil"][["host_location", "latitude", "longitude"]]
# host_location latitude longitude
# 35999 Vila Velha, Brazil 40.730115 -73.982319
# BUT IT'S NY !!!!
| host_location | latitude | longitude | |
|---|---|---|---|
| 35999 | Vila Velha, Brazil | 40.730115 | -73.982319 |
# Drop "host_location" - a lot of redundant data
df_NY = df_NY.drop("host_location", axis=1)
Spalte "has_availability" untersuchen und fehlende Werte aus "availability_365" füllen¶
Fehlende Werte 15%
Ziel: Die boolische Spalte has_availability vervollständigen, indem wir fehlende Werte aus der Spalte availability_365 ableiten.
Vorgehen:
- Füllen: Nur dort ergänzen, wo
has_availabilityNaN ist – existierende boolische Werte werden nicht überschrieben. - Check: Wie viele Zeilen gefüllt, wie viele fehlen noch.
print("Missing \"has_availability\" before:", df_NY["has_availability"].isna().sum())
Missing "has_availability" before: 5617
# Investigate "has_availability "
df_NY["has_availability"].unique()
# [True, <NA>]
df_NY[df_NY["has_availability"].isnull()][["has_availability", "availability_30", "availability_60", "availability_90", "availability_365"]]
| has_availability | availability_30 | availability_60 | availability_90 | availability_365 | |
|---|---|---|---|---|---|
| 13 | <NA> | 0 | 0 | 0 | 0 |
| 35 | <NA> | 0 | 0 | 0 | 0 |
| 42 | <NA> | 0 | 0 | 0 | 0 |
| 59 | <NA> | 0 | 0 | 0 | 0 |
| 71 | <NA> | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... |
| 36268 | <NA> | 28 | 58 | 88 | 88 |
| 36273 | <NA> | 6 | 21 | 51 | 141 |
| 36276 | <NA> | 30 | 60 | 90 | 365 |
| 36306 | <NA> | 16 | 46 | 66 | 66 |
| 36311 | <NA> | 10 | 31 | 31 | 31 |
5617 rows × 5 columns
# Investigate "has_availability "
mask = (
df_NY["availability_30"].eq(0) &
df_NY["availability_60"].eq(0) &
df_NY["availability_90"].eq(0) &
df_NY["availability_365"].eq(0) &
df_NY["has_availability"].isna()
)
count = mask.sum() # number of rows
rows = df_NY.loc[mask] # the matching rows
count
rows[["has_availability","availability_30","availability_60","availability_90","availability_365"]]
| has_availability | availability_30 | availability_60 | availability_90 | availability_365 | |
|---|---|---|---|---|---|
| 13 | <NA> | 0 | 0 | 0 | 0 |
| 35 | <NA> | 0 | 0 | 0 | 0 |
| 42 | <NA> | 0 | 0 | 0 | 0 |
| 59 | <NA> | 0 | 0 | 0 | 0 |
| 71 | <NA> | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... |
| 33888 | <NA> | 0 | 0 | 0 | 0 |
| 33982 | <NA> | 0 | 0 | 0 | 0 |
| 34117 | <NA> | 0 | 0 | 0 | 0 |
| 34294 | <NA> | 0 | 0 | 0 | 0 |
| 36088 | <NA> | 0 | 0 | 0 | 0 |
5348 rows × 5 columns
mask = (
df_NY["has_availability"].isna() |
df_NY["availability_30"].eq(0) &
df_NY["availability_60"].eq(0) &
df_NY["availability_90"].eq(0) &
df_NY["availability_365"].eq(0)
)
count = mask.sum() # number of rows
rows = df_NY.loc[mask] # the matching rows
count
rows[["has_availability","availability_30","availability_60","availability_90","availability_365"]]
| has_availability | availability_30 | availability_60 | availability_90 | availability_365 | |
|---|---|---|---|---|---|
| 6 | True | 0 | 0 | 0 | 0 |
| 9 | True | 0 | 0 | 0 | 0 |
| 10 | True | 0 | 0 | 0 | 0 |
| 13 | <NA> | 0 | 0 | 0 | 0 |
| 15 | True | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... |
| 36268 | <NA> | 28 | 58 | 88 | 88 |
| 36273 | <NA> | 6 | 21 | 51 | 141 |
| 36276 | <NA> | 30 | 60 | 90 | 365 |
| 36306 | <NA> | 16 | 46 | 66 | 66 |
| 36311 | <NA> | 10 | 31 | 31 | 31 |
12650 rows × 5 columns
# Clean df_NY["has_availability"]
# Minimal (use 365 days only)
# fill the 15% missing in has_availability: True if there is at least 1 available day in the next 365, else False
df_NY["has_availability"] = (
df_NY["has_availability"]
.fillna(df_NY["availability_365"].gt(0))
.astype("boolean")
)
print("Missing \"has_availability\" after:", df_NY["has_availability"].isna().sum())
Missing "has_availability" after: 0
missing_values(df_NY)[missing["Total_perc"] > 0].Total_perc
name 0.005506 host_since 0.041297 host_acceptance_rate 40.201531 host_verifications 0.041297 host_identity_verified 0.041297 bathrooms 0.068829 bedrooms 16.651065 beds 40.476846 estimated_revenue_l365d 40.920104 first_review 30.700402 last_review 30.700402 review_scores_rating 30.700402 review_scores_accuracy 30.727933 review_scores_cleanliness 30.703155 review_scores_checkin 30.738946 review_scores_communication 30.716921 review_scores_location 30.747206 review_scores_value 30.744452 license 85.328451 reviews_per_month 30.700402 Name: Total_perc, dtype: float64
Spalte "estimated_revenue_l365d", "host_acceptance_rate" und "beds" untersuchen¶
Fehlende Werte > 40%
Ziel: Die Spalte estimated_revenue_l365d, host_acceptance_rate und bads aufbereiten.
Vorgehen:
- Daten:
estimated_revenue_l365ddropen. - Daten:
host_acceptance_ratedropen. - Daten:
badsdropen.
# Investigate: not in the dictionary
len(df_NY["estimated_revenue_l365d"].unique())
2442
# Drop "estimated_revenue_l365d"
df_NY = df_NY.drop("estimated_revenue_l365d", axis=1)
# Investigate
# "bedrooms" missing values 16.651065
# "beds": missing values 40.476846
# mask = (
# df_NY["beds"].eq(0) &
# df_NY["bedrooms"].eq(0)
# )
# count = mask.sum()
# count
# rows = df_NY.loc[mask]
# rows[["beds", "bedrooms"]]
# 60
# mask = (
# df_NY["beds"].isnull() &
# df_NY["bedrooms"].notnull()
# )
# mask = (
# df_NY["beds"].notnull() &
# df_NY["bedrooms"].notnull() &
# df_NY["beds"] > df_NY["bedrooms"]
# )
# rows = df_NY.loc[mask]
# len(rows)*100/len(df) # 24 %
# Let's drop "beds" - viel missing data
df_NY = df_NY.drop("beds", axis=1)
# Do not need for statistic: drop "host_acceptance_rate"
df_NY = df_NY.drop("host_acceptance_rate", axis=1)
Spalten "first_review" und "last_review" bereinigen und fehlende Werte ausfüllen¶
Fehlende Werte > 30%
Ziel: Die Datumsfelder first_review und last_review in datetime64[ns] konvertieren, Ausreißer (Zukunft/Extremwerte) entfernen und fehlende Werte mit dem globalen Median-Datum pro Spalte füllen.
Vorgehen:
- Datumsformat: in
datetime64[ns]umwandeln. - Offensichtliche Ausreißer:
- Zukunftsdaten (
> heute) ->NaT. - Sehr späte Werte (z. B. > 99,5%-Quantil) ->
NaT. - Sehr frühe Werte (< 0,5%-Quantil) ->
NaT.
- Zukunftsdaten (
- Konsistenz: sicherstellen, dass
first_review ≤ last_review; sonst inkonsistente Werte aufNaT. - Check vor Imputation: Verteilungen/Anteil Missing prüfen.
- Imputation: globaler Median je Spalte (
first_review,last_review) zum Füllen vonNaT. - Check nach Imputation: Verteilungen/Kennzahlen erneut prüfen; Anteil imputierter Werte dokumentieren.
# Check missing values
date_cols = ["first_review", "last_review"]
for c in date_cols:
print("Missing values before imputation:", df_NY[c].isna().sum())
Missing values before imputation: 11151 Missing values before imputation: 11151
# Datums-Extremwerte (einfach) -> NaT, dann mit Median füllen
date_cols = ["first_review", "last_review"]
for c in date_cols:
s = pd.to_datetime(df_NY[c], errors="coerce")
# Offensichtliche Ausreißer -> NaT
s = s.mask(s > pd.Timestamp.today()) # Zukunftsdaten
upper_cap = s.quantile(0.995) # sehr späte Ausreißer (oberes 99,5%-Quantil)
s = s.mask(s > upper_cap)
# Sehr frühe Ausreißer
lower_cap = s.quantile(0.005)
s = s.mask(s < lower_cap)
# Konsistenz prüfen: first_review <= last_review, sonst konservativ auf NaT setzen
both = df_NY["first_review"].notna() & df_NY["last_review"].notna()
bad = both & (df_NY["first_review"] > df_NY["last_review"])
if bad.any():
df_NY.loc[bad, ["first_review", "last_review"]] = pd.NaT
# Missing mit globalem Median-Datum füllen
for c in ["first_review", "last_review"]:
df_NY[c] = pd.to_datetime(df_NY[c], errors="coerce") # -> datetime64[ns]
med = df_NY[c].median() # Timestamp
df_NY[c] = df_NY[c].fillna(med) # fill NaT with median
# Check missing values
for c in date_cols:
print("Missing values after imputation:", df_NY[c].isna().sum())
Missing values after imputation: 0 Missing values after imputation: 0
missing_values(df_NY)[missing["Total_perc"] > 0].Total_perc
name 0.005506 host_since 0.041297 host_verifications 0.041297 host_identity_verified 0.041297 bathrooms 0.068829 bedrooms 16.651065 review_scores_rating 30.700402 review_scores_accuracy 30.727933 review_scores_cleanliness 30.703155 review_scores_checkin 30.738946 review_scores_communication 30.716921 review_scores_location 30.747206 review_scores_value 30.744452 license 85.328451 reviews_per_month 30.700402 Name: Total_perc, dtype: float64
Spalten "review_" und "reviews_" bereinigen und fehlende Werte ausfüllen¶
Fehlende Werte > 30%
Ziel: Alle numerischen Review-Spalten (z. B. review_scores_rating, review_scores_cleanliness, reviews_per_month) konsistent säubern, Ausreißer entfernen und fehlende Werte robust mit dem spaltenspezifischen Median füllen.
Auswahl der Spalten:
view_cols = [c for c in df_NY.columns if c.lower().startswith(("review_", "reviews_"))]
Vorgehen:
- Typkonvertierung: ausgewählte Spalten in float konvertieren.
- Offensichtliche Fehler: negative Werte ->
NaN. - Ausreißer (hoch): Werte > 99,5%-Quantil pro Spalte ->
NaN. - Ausreißer (niedrig): Werte < 0,5%-Quantil ->
NaN. - Check vor Imputation: Verteilungen/Anteil Missing je Spalte prüfen.
- Imputation: fehlende Werte spaltenweise mit dem Median füllen.
- Check nach Imputation: Verteilungen und Anteil imputierter Werte dokumentieren; sicherstellen, dass keine negativen/außerhalb-Bereich-Werte verbleiben.
# Check missing values
review_cols = [c for c in df_NY.columns if c.lower().startswith(("review_", "reviews_"))]
for c in review_cols:
print("Missing values before imputation:", df_NY[c].isna().sum())
Missing values before imputation: 11151 Missing values before imputation: 11161 Missing values before imputation: 11152 Missing values before imputation: 11165 Missing values before imputation: 11157 Missing values before imputation: 11168 Missing values before imputation: 11167 Missing values before imputation: 11151
# Datums-Extremwerte -> NaT, dann mit Median füllen
for c in review_cols:
# zu float konvertieren
s = pd.to_numeric(df_NY[c], errors="coerce")
# offensichtliche Fehler/Ausreißer -> NaN
s = s.mask(s < 0) # negative Werte nicht zulässig
upper_cap = s.quantile(0.995) # sehr hohe Spitzen kappen
s = s.mask(s > upper_cap)
# Sehr niedrige Ausreißer kappen
lower_cap = s.quantile(0.005)
s = s.mask(s < lower_cap)
# fehlende Werte mit Median füllen
df_NY[c] = s.fillna(s.median())
# Check missing values
for c in review_cols:
print("Missing values before imputation:", df_NY[c].isna().sum())
Missing values before imputation: 0 Missing values before imputation: 0 Missing values before imputation: 0 Missing values before imputation: 0 Missing values before imputation: 0 Missing values before imputation: 0 Missing values before imputation: 0 Missing values before imputation: 0
Spalten mit Fehlende Werte bereinigen und fehlende Werte ausfüllen¶
name: trimmen von Leerzeichen; fehlend -> „Unbenannt“.host_since: Zukunft und >99,5%-Quantil ->NaT; fehlend -> Median-Datum.host_verifications: fehlend ->"[]"; trimmen.host_identity_verified: fehlend -> False.bathrooms: gruppierter Median nachroom_type × accommodates, sonst global;≥0clippen, auf 0,5 runden.bedrooms: gruppierter Median, sonst global;≥0clippen.license: wegen sehr hoher Missing-Rate droppen.- Report: verbleibende Missing je Spalte ausgeben.
# Spalten mit Fehlende Werte
missing_values(df_NY)[missing["Total_perc"] > 0].Total_perc
name 0.005506 host_since 0.041297 host_verifications 0.041297 host_identity_verified 0.041297 bathrooms 0.068829 bedrooms 16.651065 license 85.328451 Name: Total_perc, dtype: float64
# Spalte: "name"
df_NY["name"] = df_NY["name"].str.strip().fillna("Unbenannt")
# Spalte: "license" (sehr hohe Missing-Rate 85%)
df_NY["license"] = df_NY["license"].str.strip().fillna("Unbenannt")
# Spalte: "host_since"
s = df_NY["host_since"]
s = s.mask(s > pd.Timestamp.today()) # Zukunft → NaT
upper_cap = s.quantile(0.995) # sehr späte Ausreißer
s = s.mask(s > upper_cap)
df_NY["host_since"] = s.fillna(s.median()) # globaler Median pro Spalte
# Spalte: "host_verifications"
df_NY["host_verifications"] = df_NY["host_verifications"].fillna("[]").str.strip()
# Spalte: "host_identity_verified"
df_NY["host_identity_verified"] = df_NY["host_identity_verified"].fillna(False)
# Spalte: "bathrooms"
grp_cols = [c for c in ["room_type","accommodates"] if c in df_NY.columns]
if grp_cols:
med_g = df_NY.groupby(grp_cols)["bathrooms"].transform("median")
need = df_NY["bathrooms"].isna()
df_NY.loc[need, "bathrooms"] = med_g[need]
# Fallback global
df_NY["bathrooms"] = df_NY["bathrooms"].fillna(df_NY["bathrooms"].median())
# Aufräumen
df_NY["bathrooms"] = df_NY["bathrooms"].clip(lower=0)
df_NY["bathrooms"] = (np.round(df_NY["bathrooms"] * 2) / 2) # 0.5-Schritte
# Spalte: "bedrooms"
grp_cols = [c for c in ["room_type","accommodates"] if c in df_NY.columns]
if grp_cols:
med_g = df_NY.groupby(grp_cols)["bedrooms"].transform("median")
need = df_NY["bedrooms"].isna()
df_NY.loc[need, "bedrooms"] = med_g[need]
df_NY["bedrooms"] = df_NY["bedrooms"].fillna(df_NY["bedrooms"].median())
df_NY["bedrooms"] = df_NY["bedrooms"].clip(lower=0)
# Check
cols_report = ["name","host_since","host_verifications","host_identity_verified","bathrooms","bedrooms","license"]
present = [c for c in cols_report if c in df_NY.columns]
print("Bereinigte Spalten:", present)
print(df_NY[present].isna().sum().sort_values(ascending=False))
Bereinigte Spalten: ['name', 'host_since', 'host_verifications', 'host_identity_verified', 'bathrooms', 'bedrooms', 'license'] name 0 host_since 0 host_verifications 0 host_identity_verified 0 bathrooms 0 bedrooms 0 license 0 dtype: int64
missing_values(df_NY)[missing["Total_perc"] > 0].Total_perc
Series([], Name: Total_perc, dtype: float64)
Feature Engineering¶
Host since year erstellen¶
Die Vorgehensweise ist hier wie folgt: Spalte "host_since_year" erstellen und Jahr aus "host_since" extrahieren
Ziel: Jahr für einfache Gruppierungen/Trends ableiten.
Voraussetzung: "host_since" ist bereits datetime64.
df_NY["host_since"].head()
0 2008-09-07 1 2008-09-09 2 2009-02-03 3 2009-05-06 4 2009-05-07 Name: host_since, dtype: datetime64[ns]
# extract year from "host_since"
df_NY["host_since_year"] = df_NY["host_since"].dt.year.astype("int64")
df_NY["host_since_year"].head()
0 2008 1 2008 2 2009 3 2009 4 2009 Name: host_since_year, dtype: int64
# Erstellen der Sicherung (save dataframe to csv file)
df_NY.to_csv(f"{file_NY[:-4]}_clean.csv")
Data Understanding und Data Prepartion - Rome¶
Aufbereitung Datum: Detailed Listings für Rome, Lazio, Italy (2025)¶
Erster Blick in die Daten¶
Zu Beginn wird ein Ersteindruck der Daten gewonnen:
- Wie sieht der Datensatz gesamt aus, welche Werte stehen oben, welche unten, was für Datentypen sind vorherrschend?
- Ebenso werden die einzelnen Spalten etwas genauer angesehen, aufgeteilt u.a. in numerische und nicht-numerische Spalten.
- Anschließend werden die numerischen Spalten genauer unter die Lupe genommen: Wie viele sind vorhanden, wie sehen das arithmetische Mittel, die Standardabweichung, die Quantile als auch die Max-Werte aus?
# Rome: Datensatz einlesen + erster Blick in die Daten
file_Rome = "data/Rome/listings.csv"
data_Rome = pd.read_csv(file_Rome, low_memory=False)
# Head von data ansehen
data_Rome.head(2)
| id | listing_url | scrape_id | last_scraped | source | name | description | neighborhood_overview | picture_url | host_id | ... | review_scores_communication | review_scores_location | review_scores_value | license | instant_bookable | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2737 | https://www.airbnb.com/rooms/2737 | 20250612050804 | 2025-06-26 | city scrape | Elif's room in cozy, clean flat. | 10 min by bus you can get to Piazza Venezia or... | It used to be an industrial area until late 80... | https://a0.muscache.com/pictures/41225252/e955... | 3047 | ... | 5.00 | 4.40 | 4.40 | NaN | f | 6 | 0 | 6 | 0 | 0.04 |
| 1 | 3079 | https://www.airbnb.com/rooms/3079 | 20250612050804 | 2025-06-27 | city scrape | Cozy apartment (2-4)with Colisseum view | With the view of the Colisseum from the front ... | Monti neighborhood is one of the best areas in... | https://a0.muscache.com/pictures/miso/Hosting-... | 3504 | ... | 4.86 | 4.81 | 4.43 | NaN | f | 6 | 6 | 0 | 0 | 0.12 |
2 rows × 79 columns
# Rome: Genauere Infos zu Spalten, Datentypen
data_Rome.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36309 entries, 0 to 36308 Data columns (total 79 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 36309 non-null int64 1 listing_url 36309 non-null object 2 scrape_id 36309 non-null int64 3 last_scraped 36309 non-null object 4 source 36309 non-null object 5 name 36309 non-null object 6 description 35330 non-null object 7 neighborhood_overview 17656 non-null object 8 picture_url 36309 non-null object 9 host_id 36309 non-null int64 10 host_url 36309 non-null object 11 host_name 36293 non-null object 12 host_since 36288 non-null object 13 host_location 26981 non-null object 14 host_about 16758 non-null object 15 host_response_time 30539 non-null object 16 host_response_rate 30539 non-null object 17 host_acceptance_rate 32737 non-null object 18 host_is_superhost 34304 non-null object 19 host_thumbnail_url 36288 non-null object 20 host_picture_url 36288 non-null object 21 host_neighbourhood 12168 non-null object 22 host_listings_count 36288 non-null float64 23 host_total_listings_count 36288 non-null float64 24 host_verifications 36288 non-null object 25 host_has_profile_pic 36288 non-null object 26 host_identity_verified 36288 non-null object 27 neighbourhood 17656 non-null object 28 neighbourhood_cleansed 36309 non-null object 29 neighbourhood_group_cleansed 0 non-null float64 30 latitude 36309 non-null float64 31 longitude 36309 non-null float64 32 property_type 36309 non-null object 33 room_type 36309 non-null object 34 accommodates 36309 non-null int64 35 bathrooms 32601 non-null float64 36 bathrooms_text 36275 non-null object 37 bedrooms 35532 non-null float64 38 beds 32583 non-null float64 39 amenities 36309 non-null object 40 price 32591 non-null object 41 minimum_nights 36309 non-null int64 42 maximum_nights 36309 non-null int64 43 minimum_minimum_nights 36299 non-null float64 44 maximum_minimum_nights 36299 non-null float64 45 minimum_maximum_nights 36299 non-null float64 46 maximum_maximum_nights 36299 non-null float64 47 minimum_nights_avg_ntm 36309 non-null float64 48 maximum_nights_avg_ntm 36309 non-null float64 49 calendar_updated 0 non-null float64 50 has_availability 35718 non-null object 51 availability_30 36309 non-null int64 52 availability_60 36309 non-null int64 53 availability_90 36309 non-null int64 54 availability_365 36309 non-null int64 55 calendar_last_scraped 36309 non-null object 56 number_of_reviews 36309 non-null int64 57 number_of_reviews_ltm 36309 non-null int64 58 number_of_reviews_l30d 36309 non-null int64 59 availability_eoy 36309 non-null int64 60 number_of_reviews_ly 36309 non-null int64 61 estimated_occupancy_l365d 36309 non-null int64 62 estimated_revenue_l365d 32591 non-null float64 63 first_review 31523 non-null object 64 last_review 31523 non-null object 65 review_scores_rating 31523 non-null float64 66 review_scores_accuracy 31516 non-null float64 67 review_scores_cleanliness 31517 non-null float64 68 review_scores_checkin 31516 non-null float64 69 review_scores_communication 31517 non-null float64 70 review_scores_location 31515 non-null float64 71 review_scores_value 31516 non-null float64 72 license 33124 non-null object 73 instant_bookable 36309 non-null object 74 calculated_host_listings_count 36309 non-null int64 75 calculated_host_listings_count_entire_homes 36309 non-null int64 76 calculated_host_listings_count_private_rooms 36309 non-null int64 77 calculated_host_listings_count_shared_rooms 36309 non-null int64 78 reviews_per_month 31523 non-null float64 dtypes: float64(24), int64(20), object(35) memory usage: 21.9+ MB
# tail von data ansehen
data_Rome.tail(2)
| id | listing_url | scrape_id | last_scraped | source | name | description | neighborhood_overview | picture_url | host_id | ... | review_scores_communication | review_scores_location | review_scores_value | license | instant_bookable | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 36307 | 1439417789689826311 | https://www.airbnb.com/rooms/1439417789689826311 | 20250612050804 | 2025-06-25 | city scrape | New Studio B - Best Deal x 1 person | Studio, cozy, renovated, perfect for a person ... | NaN | https://a0.muscache.com/pictures/hosting/Hosti... | 34005675 | ... | NaN | NaN | NaN | NaN | f | 12 | 8 | 4 | 0 | NaN |
| 36308 | 1439545861830252500 | https://www.airbnb.com/rooms/1439545861830252500 | 20250612050804 | 2025-06-26 | city scrape | Via dell Orso - Rome | Discover your perfect urban escape in the hear... | Located in the heart of Rome, this neighborhoo... | https://a0.muscache.com/pictures/prohost-api/H... | 597677415 | ... | NaN | NaN | NaN | IT058091C22E6TFIXA | t | 14 | 14 | 0 | 0 | NaN |
2 rows × 79 columns
# Genereller shape des Datensatzes
data_Rome.shape
(36309, 79)
# Alle Columns anzeigen lassen
data_Rome.columns
Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name',
'description', 'neighborhood_overview', 'picture_url', 'host_id',
'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
'host_response_time', 'host_response_rate', 'host_acceptance_rate',
'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
'host_neighbourhood', 'host_listings_count',
'host_total_listings_count', 'host_verifications',
'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
'maximum_minimum_nights', 'minimum_maximum_nights',
'maximum_maximum_nights', 'minimum_nights_avg_ntm',
'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability',
'availability_30', 'availability_60', 'availability_90',
'availability_365', 'calendar_last_scraped', 'number_of_reviews',
'number_of_reviews_ltm', 'number_of_reviews_l30d', 'availability_eoy',
'number_of_reviews_ly', 'estimated_occupancy_l365d',
'estimated_revenue_l365d', 'first_review', 'last_review',
'review_scores_rating', 'review_scores_accuracy',
'review_scores_cleanliness', 'review_scores_checkin',
'review_scores_communication', 'review_scores_location',
'review_scores_value', 'license', 'instant_bookable',
'calculated_host_listings_count',
'calculated_host_listings_count_entire_homes',
'calculated_host_listings_count_private_rooms',
'calculated_host_listings_count_shared_rooms', 'reviews_per_month'],
dtype='object')
# Rome: Aufspaltung in numerische und nicht-numerische Spalten
# Numerische Spalten
data_numeric = data_Rome.select_dtypes(include=[np.number])
numeric_cols = data_numeric.columns.values
print("\n", "1 - Numerische Spalten: ", "\n")
print(numeric_cols)
# Nicht-numerische Spalten
data_non_numeric = data_Rome.select_dtypes(exclude=[np.number])
non_numeric_cols = data_non_numeric.columns.values
print("\n", "2 - Nicht-numerische Spalten: ", "\n")
print(non_numeric_cols)
1 - Numerische Spalten: ['id' 'scrape_id' 'host_id' 'host_listings_count' 'host_total_listings_count' 'neighbourhood_group_cleansed' 'latitude' 'longitude' 'accommodates' 'bathrooms' 'bedrooms' 'beds' 'minimum_nights' 'maximum_nights' 'minimum_minimum_nights' 'maximum_minimum_nights' 'minimum_maximum_nights' 'maximum_maximum_nights' 'minimum_nights_avg_ntm' 'maximum_nights_avg_ntm' 'calendar_updated' 'availability_30' 'availability_60' 'availability_90' 'availability_365' 'number_of_reviews' 'number_of_reviews_ltm' 'number_of_reviews_l30d' 'availability_eoy' 'number_of_reviews_ly' 'estimated_occupancy_l365d' 'estimated_revenue_l365d' 'review_scores_rating' 'review_scores_accuracy' 'review_scores_cleanliness' 'review_scores_checkin' 'review_scores_communication' 'review_scores_location' 'review_scores_value' 'calculated_host_listings_count' 'calculated_host_listings_count_entire_homes' 'calculated_host_listings_count_private_rooms' 'calculated_host_listings_count_shared_rooms' 'reviews_per_month'] 2 - Nicht-numerische Spalten: ['listing_url' 'last_scraped' 'source' 'name' 'description' 'neighborhood_overview' 'picture_url' 'host_url' 'host_name' 'host_since' 'host_location' 'host_about' 'host_response_time' 'host_response_rate' 'host_acceptance_rate' 'host_is_superhost' 'host_thumbnail_url' 'host_picture_url' 'host_neighbourhood' 'host_verifications' 'host_has_profile_pic' 'host_identity_verified' 'neighbourhood' 'neighbourhood_cleansed' 'property_type' 'room_type' 'bathrooms_text' 'amenities' 'price' 'has_availability' 'calendar_last_scraped' 'first_review' 'last_review' 'license' 'instant_bookable']
# Anzahl Numerische Spalten
data_Rome.select_dtypes(include="number").shape[1]
44
# Anzahl Kategoriale Spalten
data_Rome.select_dtypes(include=["object"]).shape[1]
35
# Rome: Numerische Werte genauer ansehen
data_Rome.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| id | 36309.0 | 7.091096e+17 | 5.642698e+17 | 2.737000e+03 | 3.641325e+07 | 9.080119e+17 | 1.221495e+18 | 1.440786e+18 |
| scrape_id | 36309.0 | 2.025061e+13 | 0.000000e+00 | 2.025061e+13 | 2.025061e+13 | 2.025061e+13 | 2.025061e+13 | 2.025061e+13 |
| host_id | 36309.0 | 2.532322e+08 | 2.383981e+08 | 1.822000e+03 | 2.845217e+07 | 1.627744e+08 | 4.899548e+08 | 7.004666e+08 |
| host_listings_count | 36288.0 | 1.639385e+01 | 7.222286e+01 | 1.000000e+00 | 1.000000e+00 | 3.000000e+00 | 7.000000e+00 | 1.845000e+03 |
| host_total_listings_count | 36288.0 | 2.466755e+01 | 1.352076e+02 | 1.000000e+00 | 1.000000e+00 | 3.000000e+00 | 9.000000e+00 | 8.724000e+03 |
| neighbourhood_group_cleansed | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| latitude | 36309.0 | 4.189135e+01 | 3.522517e-02 | 4.165582e+01 | 4.188332e+01 | 4.189634e+01 | 4.190670e+01 | 4.212131e+01 |
| longitude | 36309.0 | 1.248055e+01 | 4.957734e-02 | 1.224253e+01 | 1.245916e+01 | 1.247763e+01 | 1.250551e+01 | 1.283570e+01 |
| accommodates | 36309.0 | 3.820155e+00 | 1.987939e+00 | 1.000000e+00 | 2.000000e+00 | 4.000000e+00 | 5.000000e+00 | 1.600000e+01 |
| bathrooms | 32601.0 | 1.309193e+00 | 6.803291e-01 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.500000e+00 | 2.150000e+01 |
| bedrooms | 35532.0 | 1.509259e+00 | 8.989650e-01 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 2.800000e+01 |
| beds | 32583.0 | 2.223706e+00 | 1.494002e+00 | 0.000000e+00 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 4.000000e+01 |
| minimum_nights | 36309.0 | 5.544851e+00 | 1.619246e+01 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 7.300000e+02 |
| maximum_nights | 36309.0 | 3.531390e+02 | 4.084050e+02 | 1.000000e+00 | 2.900000e+01 | 3.650000e+02 | 3.650000e+02 | 1.825000e+03 |
| minimum_minimum_nights | 36299.0 | 4.988347e+00 | 1.507299e+01 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 2.000000e+00 | 7.300000e+02 |
| maximum_minimum_nights | 36299.0 | 6.625417e+00 | 2.818776e+01 | 1.000000e+00 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 1.000000e+03 |
| minimum_maximum_nights | 36299.0 | 5.050061e+02 | 4.895995e+02 | 1.000000e+00 | 3.000000e+01 | 3.650000e+02 | 1.125000e+03 | 9.999000e+03 |
| maximum_maximum_nights | 36299.0 | 5.549658e+02 | 4.950607e+02 | 1.000000e+00 | 3.000000e+01 | 3.650000e+02 | 1.125000e+03 | 9.999000e+03 |
| minimum_nights_avg_ntm | 36309.0 | 5.518841e+00 | 1.701510e+01 | 1.000000e+00 | 1.300000e+00 | 2.000000e+00 | 3.000000e+00 | 9.194000e+02 |
| maximum_nights_avg_ntm | 36309.0 | 5.292944e+02 | 4.823069e+02 | 1.000000e+00 | 3.000000e+01 | 3.650000e+02 | 1.125000e+03 | 9.999000e+03 |
| calendar_updated | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| availability_30 | 36309.0 | 1.319563e+01 | 9.657643e+00 | 0.000000e+00 | 5.000000e+00 | 1.200000e+01 | 2.100000e+01 | 3.000000e+01 |
| availability_60 | 36309.0 | 3.075667e+01 | 1.886661e+01 | 0.000000e+00 | 1.500000e+01 | 3.100000e+01 | 4.700000e+01 | 6.000000e+01 |
| availability_90 | 36309.0 | 4.971379e+01 | 2.756184e+01 | 0.000000e+00 | 2.900000e+01 | 5.200000e+01 | 7.300000e+01 | 9.000000e+01 |
| availability_365 | 36309.0 | 2.138702e+02 | 1.129729e+02 | 0.000000e+00 | 1.240000e+02 | 2.350000e+02 | 3.160000e+02 | 3.650000e+02 |
| number_of_reviews | 36309.0 | 5.744204e+01 | 9.728829e+01 | 0.000000e+00 | 3.000000e+00 | 1.700000e+01 | 6.700000e+01 | 2.683000e+03 |
| number_of_reviews_ltm | 36309.0 | 1.436966e+01 | 1.888631e+01 | 0.000000e+00 | 1.000000e+00 | 6.000000e+00 | 2.200000e+01 | 5.090000e+02 |
| number_of_reviews_l30d | 36309.0 | 1.484370e+00 | 2.063618e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.000000e+00 | 2.500000e+01 |
| availability_eoy | 36309.0 | 1.166778e+02 | 5.555669e+01 | 0.000000e+00 | 8.300000e+01 | 1.290000e+02 | 1.620000e+02 | 1.900000e+02 |
| number_of_reviews_ly | 36309.0 | 1.328861e+01 | 1.950826e+01 | 0.000000e+00 | 0.000000e+00 | 3.000000e+00 | 2.100000e+01 | 5.210000e+02 |
| estimated_occupancy_l365d | 36309.0 | 8.180738e+01 | 9.086244e+01 | 0.000000e+00 | 6.000000e+00 | 4.200000e+01 | 1.440000e+02 | 2.550000e+02 |
| estimated_revenue_l365d | 32591.0 | 1.494125e+04 | 2.859316e+04 | 0.000000e+00 | 8.760000e+02 | 6.216000e+03 | 1.989000e+04 | 1.242870e+06 |
| review_scores_rating | 31523.0 | 4.769352e+00 | 3.278882e-01 | 1.000000e+00 | 4.690000e+00 | 4.860000e+00 | 4.980000e+00 | 5.000000e+00 |
| review_scores_accuracy | 31516.0 | 4.805095e+00 | 3.129784e-01 | 1.000000e+00 | 4.750000e+00 | 4.890000e+00 | 5.000000e+00 | 5.000000e+00 |
| review_scores_cleanliness | 31517.0 | 4.784912e+00 | 3.223713e-01 | 1.000000e+00 | 4.710000e+00 | 4.880000e+00 | 5.000000e+00 | 5.000000e+00 |
| review_scores_checkin | 31516.0 | 4.845727e+00 | 2.879957e-01 | 1.000000e+00 | 4.810000e+00 | 4.920000e+00 | 5.000000e+00 | 5.000000e+00 |
| review_scores_communication | 31517.0 | 4.858408e+00 | 2.946084e-01 | 1.000000e+00 | 4.840000e+00 | 4.950000e+00 | 5.000000e+00 | 5.000000e+00 |
| review_scores_location | 31515.0 | 4.754153e+00 | 3.169800e-01 | 1.000000e+00 | 4.650000e+00 | 4.840000e+00 | 4.970000e+00 | 5.000000e+00 |
| review_scores_value | 31516.0 | 4.684645e+00 | 3.617441e-01 | 1.000000e+00 | 4.600000e+00 | 4.760000e+00 | 4.890000e+00 | 5.000000e+00 |
| calculated_host_listings_count | 36309.0 | 9.320857e+00 | 2.734508e+01 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 6.000000e+00 | 2.440000e+02 |
| calculated_host_listings_count_entire_homes | 36309.0 | 7.774216e+00 | 2.703961e+01 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 3.000000e+00 | 2.440000e+02 |
| calculated_host_listings_count_private_rooms | 36309.0 | 1.455892e+00 | 3.547885e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.000000e+00 | 4.300000e+01 |
| calculated_host_listings_count_shared_rooms | 36309.0 | 2.277672e-02 | 5.159183e-01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.800000e+01 |
| reviews_per_month | 31523.0 | 1.806902e+00 | 1.642524e+00 | 1.000000e-02 | 4.800000e-01 | 1.340000e+00 | 2.750000e+00 | 3.973000e+01 |
# Rome: Describe von Spalte Location
data_Rome["bathrooms"].describe()
count 32601.000000 mean 1.309193 std 0.680329 min 0.000000 25% 1.000000 50% 1.000000 75% 1.500000 max 21.500000 Name: bathrooms, dtype: float64
# Rome: Describe von Spalte Summary
data_Rome["availability_365"].describe()
count 36309.000000 mean 213.870197 std 112.972871 min 0.000000 25% 124.000000 50% 235.000000 75% 316.000000 max 365.000000 Name: availability_365, dtype: float64
# Rome: Describe von Spalte
data_Rome["host_since"].describe()
count 36288 unique 4852 top 2014-11-09 freq 253 Name: host_since, dtype: object
# Datums-Typen bearbeiten: object -> datetime
if "host_since" in data_Rome.columns:
data_Rome["host_since"] = pd.to_datetime(data_Rome["host_since"], errors="coerce")
data_Rome["host_since"].describe()
count 36288 mean 2018-08-24 00:31:01.904761856 min 2008-08-12 00:00:00 25% 2015-03-03 00:00:00 50% 2017-12-17 00:00:00 75% 2022-12-01 06:00:00 max 2025-06-10 00:00:00 Name: host_since, dtype: object
Zweiter Blick - Beispiele: Verfügbarkeit, Ausstattung, Host since¶
# Blick in die Daten
# Ausstattung New York ("amenities") von bestimmten loc(ation) lesen
data_Rome["amenities"].loc[3010]
'["Fire extinguisher", "Kitchen", "Wifi", "Washer", "First aid kit", "Essentials", "Heating", "TV"]'
# Blick in die Daten
# Absturz von Ausstattung New York ("amenities") oben
data_Rome.loc[3010]
id 8677467
listing_url https://www.airbnb.com/rooms/8677467
scrape_id 20250612050804
last_scraped 2025-07-02
source previous scrape
...
calculated_host_listings_count 1
calculated_host_listings_count_entire_homes 1
calculated_host_listings_count_private_rooms 0
calculated_host_listings_count_shared_rooms 0
reviews_per_month NaN
Name: 3010, Length: 79, dtype: object
# Blick in die Daten
# "host_since" - type object
# Datums-Typen bearbeiten
data_Rome["host_since"].head(1)
0 2008-09-18 Name: host_since, dtype: datetime64[ns]
# Blick in die Daten
if "host_since" in data_NY.columns:
data_Rome["host_since"] = pd.to_datetime(data_Rome["host_since"], errors="coerce")
data_Rome["host_since"].head(10)
0 2008-09-18 1 2008-10-08 2 2009-10-09 3 2009-03-30 4 2010-01-30 5 2010-01-30 6 2010-02-04 7 2010-02-04 8 2011-07-03 9 2011-07-05 Name: host_since, dtype: datetime64[ns]
# Blick in die Daten
# Hosts seit "2016-05-20"
data_Rome.loc[data_Rome["host_since"]=="2016-05-20"][["name", "host_since"]]
| name | host_since | |
|---|---|---|
| 4100 | Little Queen Guest House | 2016-05-20 |
| 6203 | Olimpico appartament | 2016-05-20 |
| 7154 | Appartamento A Roma Aquila 2 | 2016-05-20 |
| 7458 | Casa Vacanze Aquila 3 | 2016-05-20 |
| 7953 | Blancmaisonteveresuite | 2016-05-20 |
| 11431 | Appartamento fronte mare a Roma. | 2016-05-20 |
| 13827 | Appartamento Largo Arenula | 2016-05-20 |
| 15173 | Studio Appia | 2016-05-20 |
| 17873 | Blancmaisonvaticano | 2016-05-20 |
| 22787 | Rooms Sapienza | 2016-05-20 |
| 32576 | Cozy apartment Rome Eur | 2016-05-20 |
# Blick in die Daten
# Top 5 der höchsten Verfügbarkeit (availability_365)
data_5first_availability_365 = data_Rome.sort_values(by=["availability_365"], ascending=False).head(5)
data_5first_availability_365[["name", "availability_365", "host_since", "neighbourhood"]]
| name | availability_365 | host_since | neighbourhood | |
|---|---|---|---|---|
| 0 | Elif's room in cozy, clean flat. | 365 | 2008-09-18 | Rome, Lazio, Italy |
| 20491 | suite 1 | 365 | 2015-07-31 | Rome, Lazio, Italy |
| 19011 | Camera Singola in Villa con Piscina | 365 | 2023-07-27 | NaN |
| 19060 | A.Livia Navona | 365 | 2017-11-07 | NaN |
| 19111 | Casa elegante centro Roma | 365 | 2016-04-07 | Rome, Lazio, Italy |
# Blick in die Daten
# Top 5 der niedrigsten Verfügbarkeit (availability_365)
data_5last_availability_365 = data_Rome.sort_values(by=["availability_365"], ascending=False).tail(5)
data_5last_availability_365[["name", "availability_365", "host_since", "neighbourhood"]]
| name | availability_365 | host_since | neighbourhood | |
|---|---|---|---|---|
| 9311 | Best price | 0 | 2019-09-01 | NaN |
| 9310 | best price | 0 | 2019-09-01 | NaN |
| 26331 | Comfort vicino al Vaticano, ROOM 404 | 0 | 2024-07-06 | NaN |
| 26327 | Trastevere Apartment 23 | 0 | 2024-07-08 | NaN |
| 2054 | Corso 12 Luxury Rooms & Suites - Piazza Di Spagna | 0 | 2015-01-20 | Rome, Lazio, Italy |
# Blick in die Daten
# Top 5 "availability_365" mit style.bar
data_5first_availability_365[["name", "availability_365", "host_since", "neighbourhood"]].style.bar(subset=["availability_365"])
| name | availability_365 | host_since | neighbourhood | |
|---|---|---|---|---|
| 0 | Elif's room in cozy, clean flat. | 365 | 2008-09-18 00:00:00 | Rome, Lazio, Italy |
| 20491 | suite 1 | 365 | 2015-07-31 00:00:00 | Rome, Lazio, Italy |
| 19011 | Camera Singola in Villa con Piscina | 365 | 2023-07-27 00:00:00 | nan |
| 19060 | A.Livia Navona | 365 | 2017-11-07 00:00:00 | nan |
| 19111 | Casa elegante centro Roma | 365 | 2016-04-07 00:00:00 | Rome, Lazio, Italy |
# Blick in die Daten
# Check Summe von number_of_reviews
print("number_of_reviews: ", data_Rome["number_of_reviews"].sum())
# Check Summe von property_type
print("property_type: ", len(data_Rome["property_type"].unique()))
# Check Summe von neighbourhood_group_cleansed
print("neighbourhood_group_cleansed: ", len(data_Rome["neighbourhood_group_cleansed"].unique()))
number_of_reviews: 2085663 property_type: 64 neighbourhood_group_cleansed: 1
# Investigate Room Type
data_Rome["room_type"].unique()
array(['Private room', 'Entire home/apt', 'Hotel room', 'Shared room'],
dtype=object)
Deskriptive Statistik:¶
In diesem Teil werden wir uns den folgenden Kernthemen zuwenden:
- Drop Spälte mit reine URLs, Bilder, Host Name, Host-ID, Host Location, Bescreibung
- Konvertierung der "Date"-Spalte und boolische Spalte
- Behandlung von Null-Values
# Drop Spälte mit reine URLs, Bilder, Host Name, Host-ID, Host Location, Bescreibung
# Spalte behalten Info über lätzte Scrape
# calculated_host_listings_count
# calculated_host_listings_count_entire_homes
# calculated_host_listings_count_private_rooms
# calculated_host_listings_count_shared_rooms
# Spalte "host_name" - Name of the host. Usually just the first name(s).
# Spalten "host_listings_count", "host_total_listings_count": The number of listings the host has (per Airbnb unknown calculations)
# Spalte brauche für Analyse nicht:
drop_columns = ["listing_url",
"scrape_id",
"last_scraped",
"source",
"description",
"neighborhood_overview",
"picture_url",
"host_url",
"host_name",
"host_about",
"host_thumbnail_url",
"host_picture_url",
"host_listings_count",
"host_total_listings_count",
"host_has_profile_pic",
"calendar_last_scraped",
"calculated_host_listings_count",
"calculated_host_listings_count_entire_homes",
"calculated_host_listings_count_private_rooms",
"calculated_host_listings_count_shared_rooms"
]
df_Rome = data_Rome.drop(columns=drop_columns, axis=1, errors="ignore")
print(df_Rome[["host_identity_verified","instant_bookable","has_availability"]].head())
# Bool-Typen bearbeiten
def to_bool(x):
if pd.isna(x):
return pd.NA
s = str(x).strip().lower()
if s == "t" or x == True: return True
if s == "f" or x == False: return False
return pd.NA
col_bool = ["host_identity_verified","instant_bookable","has_availability"]
for c in ["host_identity_verified", "instant_bookable","has_availability"]:
if c in df_Rome.columns:
df_Rome[c] = df_Rome[c].map(to_bool).astype("boolean")
host_identity_verified instant_bookable has_availability 0 t f t 1 t f t 2 t f t 3 t f t 4 t t t
print(df_Rome[["host_identity_verified","instant_bookable","has_availability"]].head())
host_identity_verified instant_bookable has_availability 0 True False True 1 True False True 2 True False True 3 True False True 4 True True True
# Spalte "license" - Werte [nan, 'Exempt', "IT*"]
# Spalte "license" nicht droppen
[i for i in df_Rome["license"].unique() if str(i)[:2].upper() != "IT"]
[nan, 'I applied for a CIN, but haven’t received it yet', 'Exempt']
#Check for the missing values
df_Rome.isna().any()
id False name False host_id False host_since True host_location True host_response_time True host_response_rate True host_acceptance_rate True host_is_superhost True host_neighbourhood True host_verifications True host_identity_verified True neighbourhood True neighbourhood_cleansed False neighbourhood_group_cleansed True latitude False longitude False property_type False room_type False accommodates False bathrooms True bathrooms_text True bedrooms True beds True amenities False price True minimum_nights False maximum_nights False minimum_minimum_nights True maximum_minimum_nights True minimum_maximum_nights True maximum_maximum_nights True minimum_nights_avg_ntm False maximum_nights_avg_ntm False calendar_updated True has_availability True availability_30 False availability_60 False availability_90 False availability_365 False number_of_reviews False number_of_reviews_ltm False number_of_reviews_l30d False availability_eoy False number_of_reviews_ly False estimated_occupancy_l365d False estimated_revenue_l365d True first_review True last_review True review_scores_rating True review_scores_accuracy True review_scores_cleanliness True review_scores_checkin True review_scores_communication True review_scores_location True review_scores_value True license True instant_bookable False reviews_per_month True dtype: bool
# Zusätchlive Funktionen
# Missing Values
missing = df_Rome.isnull().sum().to_frame()
def missing_values(df):
missing["Missing_values"] = df_Rome.isnull().sum().to_frame()
missing["Total_perc"] = (missing["Missing_values"] / len(df)) * 100
return missing
missing_values(df_Rome)[missing["Total_perc"] > 0].Total_perc
host_since 0.057837 host_location 25.690600 host_response_time 15.891377 host_response_rate 15.891377 host_acceptance_rate 9.837781 host_is_superhost 5.522047 host_neighbourhood 66.487648 host_verifications 0.057837 host_identity_verified 0.057837 neighbourhood 51.372938 neighbourhood_group_cleansed 100.000000 bathrooms 10.212344 bathrooms_text 0.093641 bedrooms 2.139965 beds 10.261919 price 10.239885 minimum_minimum_nights 0.027541 maximum_minimum_nights 0.027541 minimum_maximum_nights 0.027541 maximum_maximum_nights 0.027541 calendar_updated 100.000000 has_availability 1.627696 estimated_revenue_l365d 10.239885 first_review 13.181305 last_review 13.181305 review_scores_rating 13.181305 review_scores_accuracy 13.200584 review_scores_cleanliness 13.197830 review_scores_checkin 13.200584 review_scores_communication 13.197830 review_scores_location 13.203338 review_scores_value 13.200584 license 8.771930 reviews_per_month 13.181305 Name: Total_perc, dtype: float64
# Spalte: "calendar_updated" Missing-Rate 100%
df_Rome["calendar_updated"].unique()
# Spalte "neighbourhood" Missing-Rate 100%
df_Rome["neighbourhood_group_cleansed"].unique()
# Missing-Rate > 40% and no description in the data Doctionary:
# host_is_superhost
# host_response_time
# host_response_rate
array([nan])
df_Rome["neighbourhood_group_cleansed"] = df_Rome["neighbourhood"]
drop_columns = ["host_response_time",
"host_response_rate",
"host_is_superhost",
"neighbourhood",
"calendar_updated",
]
df_Rome = df_Rome.drop(columns=drop_columns, axis=1, errors="ignore")
# print(df_Rome.info())
# Median
df_Rome.median(numeric_only=True)
id 908011910692434944.0 host_id 162774411.0 host_identity_verified 1.0 latitude 41.896342 longitude 12.47763 accommodates 4.0 bathrooms 1.0 bedrooms 1.0 beds 2.0 minimum_nights 2.0 maximum_nights 365.0 minimum_minimum_nights 2.0 maximum_minimum_nights 2.0 minimum_maximum_nights 365.0 maximum_maximum_nights 365.0 minimum_nights_avg_ntm 2.0 maximum_nights_avg_ntm 365.0 has_availability 1.0 availability_30 12.0 availability_60 31.0 availability_90 52.0 availability_365 235.0 number_of_reviews 17.0 number_of_reviews_ltm 6.0 number_of_reviews_l30d 0.0 availability_eoy 129.0 number_of_reviews_ly 3.0 estimated_occupancy_l365d 42.0 estimated_revenue_l365d 6216.0 review_scores_rating 4.86 review_scores_accuracy 4.89 review_scores_cleanliness 4.88 review_scores_checkin 4.92 review_scores_communication 4.95 review_scores_location 4.84 review_scores_value 4.76 instant_bookable 1.0 reviews_per_month 1.34 dtype: Float64
# Mean of "accommodates"
df_Rome["accommodates"].mean().round(2)
np.float64(3.82)
# Mean of "minimum_nights" and "maximum_nights"
df_Rome["minimum_nights"].mean().round(2), df_Rome["maximum_nights"].mean().round(2)
(np.float64(5.54), np.float64(353.14))
IQR (Interquartile Range)¶
num = df_Rome.select_dtypes(include=[np.floating, np.integer]) num.quantile(0.75) - num.quantile(0.25)
df_Rome.describe(include=['object', 'category'])
| name | host_location | host_acceptance_rate | host_neighbourhood | host_verifications | neighbourhood_cleansed | neighbourhood_group_cleansed | property_type | room_type | bathrooms_text | amenities | price | first_review | last_review | license | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 36309 | 26981 | 32737 | 12168 | 36288 | 36309 | 17656 | 36309 | 36309 | 36275 | 36309 | 32591 | 31523 | 31523 | 33124 |
| unique | 35376 | 559 | 99 | 118 | 6 | 15 | 108 | 64 | 4 | 41 | 34807 | 1079 | 4173 | 1777 | 25919 |
| top | Vacanze Romane | Rome, Italy | 100% | Prati | ['email', 'phone'] | I Centro Storico | Rome, Lazio, Italy | Entire rental unit | Entire home/apt | 1 bath | ["Kitchen", "TV", "Washer", "Wifi", "Air condi... | $90.00 | 2025-01-02 | 2025-06-22 | I applied for a CIN, but haven’t received it yet |
| freq | 16 | 24087 | 21186 | 1402 | 31342 | 17889 | 16951 | 18132 | 27345 | 19693 | 42 | 424 | 142 | 1192 | 378 |
Fehlende Daten - Null-Werte¶
Mithilfe von isnull.().sum() erkennen wir, dass fast alle Spalten Null-Werte haben:
df_Rome.isnull().sum()
id 0 name 0 host_id 0 host_since 21 host_location 9328 host_acceptance_rate 3572 host_neighbourhood 24141 host_verifications 21 host_identity_verified 21 neighbourhood_cleansed 0 neighbourhood_group_cleansed 18653 latitude 0 longitude 0 property_type 0 room_type 0 accommodates 0 bathrooms 3708 bathrooms_text 34 bedrooms 777 beds 3726 amenities 0 price 3718 minimum_nights 0 maximum_nights 0 minimum_minimum_nights 10 maximum_minimum_nights 10 minimum_maximum_nights 10 maximum_maximum_nights 10 minimum_nights_avg_ntm 0 maximum_nights_avg_ntm 0 has_availability 591 availability_30 0 availability_60 0 availability_90 0 availability_365 0 number_of_reviews 0 number_of_reviews_ltm 0 number_of_reviews_l30d 0 availability_eoy 0 number_of_reviews_ly 0 estimated_occupancy_l365d 0 estimated_revenue_l365d 3718 first_review 4786 last_review 4786 review_scores_rating 4786 review_scores_accuracy 4793 review_scores_cleanliness 4792 review_scores_checkin 4793 review_scores_communication 4792 review_scores_location 4794 review_scores_value 4793 license 3185 instant_bookable 0 reviews_per_month 4786 dtype: int64
# Anzahl unterschiedlicher Zeilen: 36309
# Keine Duplikate
df_Rome.drop_duplicates().shape[0]
36309
Spalten Price, Bathrooms, has_availability säubern¶
Spalte "price" bereinigen und fehlende Werte ausfüllen¶
Fehlende Werte 10.0%
Ziel: Die Spalte price sauber in Float konvertieren, Ausreißer entfernen und fehlende Werte hierarchischer Median-basiert imputieren.
Vorgehen:
- Preis säubern: in Float umwandeln; Währungszeichen und Tausendertrennzeichen entfernen.
- Unplausible Werte:
price <= 0oder sehr hoch (z. B. > 99,5%-Quantil) →NaN. - Check vor Imputation: Verteilungen/Kennzahlen vor dem Füllen prüfen.
- Imputation: hierarchischer Median.
- Check nach Imputation: Verteilungen/Kennzahlen nach dem Füllen prüfen.
print("Missing price before imputation:", df_Rome["price"].isna().sum())
Missing price before imputation: 3718
# Investigate "price" column
df_Rome["price"].loc[20:26]
def find_currency(df):
currency = []
for i in df["price"].unique():
currency.append(str(i)[0])
currency.remove('n')
return set(currency)
find_currency(df_Rome)
{'$'}
# Preis säubern: object -> float
df_Rome["price"] = (df_Rome["price"].astype(str)
.str.replace(r"[^\d.\-]", "", regex=True)
.replace({"": np.nan}))
df_Rome["price"] = pd.to_numeric(df_Rome["price"], errors="coerce")
# Fallback: globaler Median
if df_Rome["price"].isna().any():
df_Rome["price"] = df_Rome["price"].fillna(df_Rome["price"].median())
# Check
print("Missing price after imputation:", df_Rome["price"].isna().sum())
Missing price after imputation: 0
Spalte "bathrooms" bereinigen und fehlende Werte aus "bathrooms_text" füllen¶
Fehlende Werte 10.0%
Ziel: Die numerische Spalte bathrooms (float) vervollständigen, indem wir fehlende Werte aus der textuellen Spalte bathrooms_text ableiten. Beide Felder beschreiben dieselbe Information.
Vorgehen:
- Parsing-Regeln:
- Zahl direkt vor bath/bathrooms (z. B.
1 bath,1.5 baths,2 bathrooms,0 baths). - half-bath ohne Zahl ist gleich 0.5.
- Zusätze wie private/shared werden ignoriert.
- Zahl direkt vor bath/bathrooms (z. B.
- Füllen: Nur dort ergänzen, wo
bathroomsNaN ist – existierende numerische Werte werden nicht überschrieben. - Check: Wie viele Zeilen gefüllt, wie viele fehlen noch.
df_Rome[df_Rome["bathrooms"].notna()][["bathrooms", "bathrooms_text"]].head(10)
| bathrooms | bathrooms_text | |
|---|---|---|
| 0 | 1.5 | 1.5 baths |
| 1 | 1.0 | 1 bath |
| 2 | 1.0 | 1 bath |
| 3 | 1.0 | 1 bath |
| 4 | 1.0 | 1 bath |
| 5 | 1.0 | 1 bath |
| 6 | 1.0 | 1 bath |
| 7 | 1.0 | 1 bath |
| 8 | 2.0 | 2 shared baths |
| 9 | 1.0 | 1 bath |
# Investigate
df_Rome["bathrooms"].unique()
array([ 1.5, 1. , 2. , nan, 4. , 3. , 5. , 6.5, 2.5, 4.5, 0. ,
5.5, 6. , 12.5, 7. , 8. , 3.5, 0.5, 7.5, 12. , 9.5, 21.5,
10. , 9. , 11. ])
# Investigate
df_Rome["bathrooms_text"].unique()
array(['1.5 baths', '1 bath', '2 shared baths', '1 private bath',
'2 baths', '4 baths', '3 baths', '5 baths', '6.5 baths',
'3 shared baths', '2.5 baths', '1 shared bath', '4.5 baths',
'0 shared baths', 'Half-bath', nan, '12.5 baths', '5.5 baths',
'1.5 shared baths', '6 baths', '0 baths', '7 baths',
'8 shared baths', '3.5 baths', 'Shared half-bath',
'6 shared baths', '2.5 shared baths', '8 baths', '7.5 baths',
'12 baths', '7 shared baths', '9.5 baths', '21.5 baths',
'4 shared baths', '10 shared baths', '5 shared baths', '9 baths',
'Private half-bath', '10 baths', '24 baths', '7.5 shared baths',
'11 baths'], dtype=object)
# Investigate: "bathrooms", "bathrooms_text"
df_bathrooms_null_bathrooms_text = df_Rome[df_Rome["bathrooms"].isnull()][["bathrooms", "bathrooms_text"]]
df_bathrooms_null_bathrooms_text.head(10)
# df_bathrooms_null_bathrooms_text.index.values
idx = df_bathrooms_null_bathrooms_text.index.values
# fill df_Rome["bathrooms"] from df_Rome["bathrooms_text"] only where it’s missing
# (handles “1 bath”, “1.5 baths”, “Half-bath”, “shared half-bath”, “2 bathrooms”, etc.):
pattern = re.compile(r"""(?ix)
^\s* # allow leading spaces
(?: # start alternation
(?P<num>\d+(?:\.\d+)?)\s* # 1, 1.5, 0, 15.5
(?: (?:private|shared)\s+ )? # optional qualifier after number
bath(?:room)?s? # bath / bathroom(s)
|
(?: (?:private|shared)\s+ )? # optional qualifier before 'half'
half[-\s]?bath(?:room)?s? # half-bath / half bathroom(s)
)
\s*$ # allow trailing spaces
""")
def parse_bath(text: str):
"""Parse bath count from bathrooms_text (your listed formats)."""
if not isinstance(text, str) or not text.strip():
return np.nan
m = pattern.match(text.strip())
if not m:
return np.nan
if m.group('num') is not None: # numeric forms: '1 bath', '1.5 shared baths', '0 baths'
return float(m.group('num'))
else: # 'half-bath', 'shared half-bath', 'private half-bath'
return 0.5
# fill only missing bathrooms from parsed bathrooms_text
df_Rome["bathrooms"] = df_Rome["bathrooms"].fillna(df_Rome["bathrooms_text"].apply(parse_bath))
# Missing values
df_Rome["bathrooms"].isna().sum()
np.int64(21)
df_Rome["bathrooms"].unique()
array([ 1.5, 1. , 2. , 4. , 3. , 5. , 6.5, 2.5, 4.5, 0. , 0.5,
12.5, 5.5, 6. , 7. , 8. , 3.5, nan, 7.5, 12. , 9.5, 21.5,
10. , 9. , 24. , 11. ])
df_bathrooms_null_bathrooms_text_filled = df_Rome[df_Rome["bathrooms"].isnull()][["bathrooms", "bathrooms_text"]]
len(df_bathrooms_null_bathrooms_text_filled)
21
df_bathrooms_null_bathrooms_text_filled.index.values
array([ 2229, 4822, 4854, 5303, 5482, 7216, 7636, 8374, 8716,
8768, 8929, 18185, 20526, 20528, 21166, 22192, 32461, 33110,
33699, 33980, 36113])
df_Rome = df_Rome.drop("bathrooms_text", axis=1, errors="ignore")
Spalte "host_neighbourhood" und "host_location" untersuchen¶
Fehlende Werte > 66% und 25% entspr.
Ziel: Die Spalte host_neighbourhood und host_location aufbereiten.
Vorgehen:
- Daten:
host_neighbourhoodmitneighbourhood_cleansedvergleichen. - Daten:
host_neighbourhooddropen. - Daten:
host_locationmitatitudeundlongitudevergleichen. - Daten:
host_locationdropen.
# Investigate: no description in the data Doctionary
df_Rome["host_neighbourhood"].unique()[:10]
array(['Testaccio', 'Monti', 'Flaminio', 'Prati', 'Trastevere',
'San Lorenzo', 'Esquilino', nan, 'Pigneto', 'Aurelio'],
dtype=object)
df_Rome[df_Rome["host_neighbourhood"].notnull()][["host_neighbourhood", "neighbourhood_cleansed"]].head(10)
| host_neighbourhood | neighbourhood_cleansed | |
|---|---|---|
| 0 | Testaccio | VIII Appia Antica |
| 1 | Monti | I Centro Storico |
| 2 | Monti | I Centro Storico |
| 3 | Flaminio | II Parioli/Nomentano |
| 4 | Prati | I Centro Storico |
| 5 | Prati | I Centro Storico |
| 6 | Trastevere | I Centro Storico |
| 7 | Trastevere | I Centro Storico |
| 8 | San Lorenzo | II Parioli/Nomentano |
| 9 | Monti | I Centro Storico |
len(df_Rome[df_Rome["host_neighbourhood"] == df_Rome["neighbourhood_cleansed"]][["host_neighbourhood", "neighbourhood_cleansed"]])
0
len(df_Rome[df_Rome["host_neighbourhood"] != df_Rome["neighbourhood_cleansed"]][["host_neighbourhood", "neighbourhood_cleansed"]])
36309
# Drop "host_neighbourhood". The reason: we have a column "neighbourhood_cleansed" with correct location
df_Rome = df_Rome.drop("host_neighbourhood", axis=1, errors="ignore")
# Description in the data Doctionary: The host's self reported location
df_Rome["host_location"].unique()[:10]
array(['Rome, Italy', nan, 'Milan, Italy', 'Vienna, Austria',
'Cerveteri, Italy', 'Alella, Spain', 'Ladispoli, Italy',
'London, United Kingdom', 'Giulianova, Italy', 'United Kingdom'],
dtype=object)
# Compare the column "host_location" with "latitude" and "longitude"
df_Rome[df_Rome["host_location"].notnull()][["host_location", "latitude", "longitude"]].head(10)
| host_location | latitude | longitude | |
|---|---|---|---|
| 0 | Rome, Italy | 41.871360 | 12.482150 |
| 1 | Rome, Italy | 41.895000 | 12.491170 |
| 2 | Rome, Italy | 41.895447 | 12.491181 |
| 3 | Rome, Italy | 41.925820 | 12.469280 |
| 4 | Rome, Italy | 41.907310 | 12.453500 |
| 5 | Rome, Italy | 41.909200 | 12.453440 |
| 6 | Rome, Italy | 41.889920 | 12.468230 |
| 7 | Rome, Italy | 41.889920 | 12.468230 |
| 8 | Rome, Italy | 41.897860 | 12.514600 |
| 9 | Rome, Italy | 41.888360 | 12.501480 |
# Investigate the column "host_location"
# Convert type df_Rome["host_location"] to string
df_Rome["host_location"] = df_Rome["host_location"].astype("string")
# Drop "host_location" - a lot of redundant data
df_Rome = df_Rome.drop("host_location", axis=1, errors="ignore")
Spalte "has_availability" untersuchen und fehlende Werte aus "availability_365" füllen¶
Fehlende Werte 1.6%
Ziel: Die boolische Spalte has_availability vervollständigen, indem wir fehlende Werte aus der Spalte availability_365 ableiten.
Vorgehen:
- Füllen: Nur dort ergänzen, wo
has_availabilityNaN ist – existierende boolische Werte werden nicht überschrieben. - Check: Wie viele Zeilen gefüllt, wie viele fehlen noch.
print("Missing \"has_availability\" before:", df_Rome["has_availability"].isna().sum())
Missing "has_availability" before: 591
# Investigate "has_availability "
df_Rome["has_availability"].unique()
# [True, <NA>]
df_Rome[df_Rome["has_availability"].isnull()][["has_availability", "availability_30", "availability_60", "availability_90", "availability_365"]]
| has_availability | availability_30 | availability_60 | availability_90 | availability_365 | |
|---|---|---|---|---|---|
| 246 | <NA> | 0 | 0 | 0 | 0 |
| 248 | <NA> | 0 | 0 | 0 | 0 |
| 321 | <NA> | 0 | 0 | 0 | 0 |
| 354 | <NA> | 0 | 0 | 0 | 0 |
| 386 | <NA> | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... |
| 36091 | <NA> | 15 | 16 | 26 | 281 |
| 36103 | <NA> | 30 | 60 | 90 | 365 |
| 36263 | <NA> | 19 | 49 | 76 | 351 |
| 36269 | <NA> | 29 | 59 | 89 | 364 |
| 36284 | <NA> | 25 | 55 | 85 | 175 |
591 rows × 5 columns
# Investigate "has_availability "
mask = (
df_Rome["availability_30"].eq(0) &
df_Rome["availability_60"].eq(0) &
df_Rome["availability_90"].eq(0) &
df_Rome["availability_365"].eq(0) &
df_Rome["has_availability"].isna()
)
count = mask.sum() # number of rows
rows = df_Rome.loc[mask] # the matching rows
count
rows[["has_availability","availability_30","availability_60","availability_90","availability_365"]]
| has_availability | availability_30 | availability_60 | availability_90 | availability_365 | |
|---|---|---|---|---|---|
| 246 | <NA> | 0 | 0 | 0 | 0 |
| 248 | <NA> | 0 | 0 | 0 | 0 |
| 321 | <NA> | 0 | 0 | 0 | 0 |
| 354 | <NA> | 0 | 0 | 0 | 0 |
| 386 | <NA> | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... |
| 32065 | <NA> | 0 | 0 | 0 | 0 |
| 32220 | <NA> | 0 | 0 | 0 | 0 |
| 32236 | <NA> | 0 | 0 | 0 | 0 |
| 33791 | <NA> | 0 | 0 | 0 | 0 |
| 35726 | <NA> | 0 | 0 | 0 | 0 |
517 rows × 5 columns
mask = (
df_Rome["has_availability"].isna() |
df_Rome["availability_30"].eq(0) &
df_Rome["availability_60"].eq(0) &
df_Rome["availability_90"].eq(0) &
df_Rome["availability_365"].eq(0)
)
count = mask.sum() # number of rows
rows = df_Rome.loc[mask] # the matching rows
count
rows[["has_availability","availability_30","availability_60","availability_90","availability_365"]]
| has_availability | availability_30 | availability_60 | availability_90 | availability_365 | |
|---|---|---|---|---|---|
| 14 | True | 0 | 0 | 0 | 0 |
| 33 | True | 0 | 0 | 0 | 0 |
| 43 | True | 0 | 0 | 0 | 0 |
| 82 | True | 0 | 0 | 0 | 0 |
| 103 | True | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... |
| 36103 | <NA> | 30 | 60 | 90 | 365 |
| 36155 | True | 0 | 0 | 0 | 0 |
| 36263 | <NA> | 19 | 49 | 76 | 351 |
| 36269 | <NA> | 29 | 59 | 89 | 364 |
| 36284 | <NA> | 25 | 55 | 85 | 175 |
2229 rows × 5 columns
# Clean df_Rome["has_availability"]
# Minimal (use 365 days only)
# fill the 15% missing in has_availability: True if there is at least 1 available day in the next 365, else False
df_Rome["has_availability"] = (
df_Rome["has_availability"]
.fillna(df_Rome["availability_365"].gt(0))
.astype("boolean")
)
print("Missing \"has_availability\" after:", df_Rome["has_availability"].isna().sum())
Missing "has_availability" after: 0
missing_values(df_Rome)[missing["Total_perc"] > 0].Total_perc
host_since 0.057837 host_acceptance_rate 9.837781 host_verifications 0.057837 host_identity_verified 0.057837 neighbourhood_group_cleansed 51.372938 bathrooms 0.057837 bedrooms 2.139965 beds 10.261919 minimum_minimum_nights 0.027541 maximum_minimum_nights 0.027541 minimum_maximum_nights 0.027541 maximum_maximum_nights 0.027541 estimated_revenue_l365d 10.239885 first_review 13.181305 last_review 13.181305 review_scores_rating 13.181305 review_scores_accuracy 13.200584 review_scores_cleanliness 13.197830 review_scores_checkin 13.200584 review_scores_communication 13.197830 review_scores_location 13.203338 review_scores_value 13.200584 license 8.771930 reviews_per_month 13.181305 Name: Total_perc, dtype: float64
Spalte "estimated_revenue_l365d", "host_acceptance_rate" und "beds" untersuchen¶
Fehlende Werte > 10%
Ziel: Die Spalte estimated_revenue_l365d, host_acceptance_rate und bads aufbereiten.
Vorgehen:
- Daten:
estimated_revenue_l365ddropen. - Daten:
host_acceptance_ratedropen. - Daten:
badsdropen.
# Investigate: not in the dictionary
len(df_Rome["estimated_revenue_l365d"].unique())
5238
# Drop "estimated_revenue_l365d"
df_Rome = df_Rome.drop("estimated_revenue_l365d", axis=1, errors="ignore")
# Investigate
# mask = (
# df_Rome["beds"].eq(0) &
# df_Rome["bedrooms"].eq(0)
# )
# count = mask.sum()
# count
# rows = df_Rome.loc[mask]
# rows[["beds", "bedrooms"]]
# 60
# mask = (
# df_Rome["beds"].isnull() &
# df_Rome["bedrooms"].notnull()
# )
# mask = (
# df_Rome["beds"].notnull() &
# df_Rome["bedrooms"].notnull() &
# df_Rome["beds"] > df_Rome["bedrooms"]
# )
# rows = df_Rome.loc[mask]
# len(rows)*100/len(df) # 24 %
# Let's drop "beds" - viel missing data
df_Rome = df_Rome.drop("beds", axis=1)
# Do not need for statistic: drop "host_acceptance_rate"
df_Rome = df_Rome.drop("host_acceptance_rate", axis=1)
Spalten "first_review" und "last_review" bereinigen und fehlende Werte ausfüllen¶
Fehlende Werte > 13%
Ziel: Die Datumsfelder first_review und last_review in datetime64[ns] konvertieren, Ausreißer (Zukunft/Extremwerte) entfernen und fehlende Werte mit dem globalen Median-Datum pro Spalte füllen.
Vorgehen:
- Datumsformat: in
datetime64[ns]umwandeln. - Offensichtliche Ausreißer:
- Zukunftsdaten (
> heute) ->NaT. - Sehr späte Werte (z. B. > 99,5%-Quantil) ->
NaT. - Sehr frühe Werte (< 0,5%-Quantil) ->
NaT.
- Zukunftsdaten (
- Konsistenz: sicherstellen, dass
first_review ≤ last_review; sonst inkonsistente Werte aufNaT. - Check vor Imputation: Verteilungen/Anteil Missing prüfen.
- Imputation: globaler Median je Spalte (
first_review,last_review) zum Füllen vonNaT. - Check nach Imputation: Verteilungen/Kennzahlen erneut prüfen; Anteil imputierter Werte dokumentieren.
# Check missing values
date_cols = ["first_review", "last_review"]
for c in date_cols:
print("Missing values before imputation:", df_Rome[c].isna().sum())
Missing values before imputation: 4786 Missing values before imputation: 4786
# Datums-Extremwerte (einfach) -> NaT, dann mit Median füllen
date_cols = ["first_review", "last_review"]
for c in date_cols:
s = pd.to_datetime(df_Rome[c], errors="coerce")
# Offensichtliche Ausreißer -> NaT
s = s.mask(s > pd.Timestamp.today()) # Zukunftsdaten
upper_cap = s.quantile(0.995) # sehr späte Ausreißer (oberes 99,5%-Quantil)
s = s.mask(s > upper_cap)
# Sehr frühe Ausreißer
lower_cap = s.quantile(0.005)
s = s.mask(s < lower_cap)
# Konsistenz prüfen: first_review <= last_review, sonst konservativ auf NaT setzen
both = df_Rome["first_review"].notna() & df_Rome["last_review"].notna()
bad = both & (df_Rome["first_review"] > df_Rome["last_review"])
if bad.any():
df_Rome.loc[bad, ["first_review", "last_review"]] = pd.NaT
# Missing mit globalem Median-Datum füllen
for c in ["first_review", "last_review"]:
df_Rome[c] = pd.to_datetime(df_Rome[c], errors="coerce") # -> datetime64[ns]
med = df_Rome[c].median() # Timestamp
df_Rome[c] = df_Rome[c].fillna(med) # fill NaT with median
# Check missing values
for c in date_cols:
print("Missing values after imputation:", df_Rome[c].isna().sum())
Missing values after imputation: 0 Missing values after imputation: 0
missing_values(df_Rome)[missing["Total_perc"] > 0].Total_perc
host_since 0.057837 host_verifications 0.057837 host_identity_verified 0.057837 neighbourhood_group_cleansed 51.372938 bathrooms 0.057837 bedrooms 2.139965 minimum_minimum_nights 0.027541 maximum_minimum_nights 0.027541 minimum_maximum_nights 0.027541 maximum_maximum_nights 0.027541 review_scores_rating 13.181305 review_scores_accuracy 13.200584 review_scores_cleanliness 13.197830 review_scores_checkin 13.200584 review_scores_communication 13.197830 review_scores_location 13.203338 review_scores_value 13.200584 license 8.771930 reviews_per_month 13.181305 Name: Total_perc, dtype: float64
Spalten "review_" und "reviews_" bereinigen und fehlende Werte ausfüllen¶
Fehlende Werte > 13%
Ziel: Alle numerischen Review-Spalten (z. B. review_scores_rating, review_scores_cleanliness, reviews_per_month) konsistent säubern, Ausreißer entfernen und fehlende Werte robust mit dem spaltenspezifischen Median füllen.
Auswahl der Spalten:
view_cols = [c for c in df_Rome.columns if c.lower().startswith(("review_", "reviews_"))]
Vorgehen:
- Typkonvertierung: ausgewählte Spalten in float konvertieren.
- Offensichtliche Fehler: negative Werte ->
NaN. - Ausreißer (hoch): Werte > 99,5%-Quantil pro Spalte ->
NaN. - Ausreißer (niedrig): Werte < 0,5%-Quantil ->
NaN. - Check vor Imputation: Verteilungen/Anteil Missing je Spalte prüfen.
- Imputation: fehlende Werte spaltenweise mit dem Median füllen.
- Check nach Imputation: Verteilungen und Anteil imputierter Werte dokumentieren; sicherstellen, dass keine negativen/außerhalb-Bereich-Werte verbleiben.
# Check missing values
review_cols = [c for c in df_Rome.columns if c.lower().startswith(("review_", "reviews_"))]
for c in review_cols:
print("Missing values before imputation:", df_Rome[c].isna().sum())
Missing values before imputation: 4786 Missing values before imputation: 4793 Missing values before imputation: 4792 Missing values before imputation: 4793 Missing values before imputation: 4792 Missing values before imputation: 4794 Missing values before imputation: 4793 Missing values before imputation: 4786
# Datums-Extremwerte -> NaT, dann mit Median füllen
for c in review_cols:
# zu float konvertieren
s = pd.to_numeric(df_Rome[c], errors="coerce")
# offensichtliche Fehler/Ausreißer -> NaN
s = s.mask(s < 0) # negative Werte nicht zulässig
upper_cap = s.quantile(0.995) # sehr hohe Spitzen kappen
s = s.mask(s > upper_cap)
# Sehr niedrige Ausreißer kappen
lower_cap = s.quantile(0.005)
s = s.mask(s < lower_cap)
# fehlende Werte mit Median füllen
df_Rome[c] = s.fillna(s.median())
# Check missing values
for c in review_cols:
print("Missing values before imputation:", df_Rome[c].isna().sum())
Missing values before imputation: 4786 Missing values before imputation: 4793 Missing values before imputation: 4792 Missing values before imputation: 4793 Missing values before imputation: 4792 Missing values before imputation: 4794 Missing values before imputation: 4793 Missing values before imputation: 4786
Spalten mit Fehlende Werte bereinigen und fehlende Werte ausfüllen¶
name: trimmen von Leerzeichen; fehlend -> „Unbenannt“.host_since: Zukunft und >99,5%-Quantil ->NaT; fehlend -> Median-Datum.host_verifications: fehlend ->"[]"; trimmen.host_identity_verified: fehlend -> False.bathrooms: gruppierter Median nachroom_type × accommodates, sonst global;≥0clippen, auf 0,5 runden.bedrooms: gruppierter Median, sonst global;≥0clippen.license: wegen sehr hoher Missing-Rate droppen.- Report: verbleibende Missing je Spalte ausgeben.
# Spalten mit Fehlende Werte
missing_values(df_Rome)[missing["Total_perc"] > 0].Total_perc
host_since 0.057837 host_verifications 0.057837 host_identity_verified 0.057837 neighbourhood_group_cleansed 51.372938 bathrooms 0.057837 bedrooms 2.139965 minimum_minimum_nights 0.027541 maximum_minimum_nights 0.027541 minimum_maximum_nights 0.027541 maximum_maximum_nights 0.027541 review_scores_rating 13.181305 review_scores_accuracy 13.200584 review_scores_cleanliness 13.197830 review_scores_checkin 13.200584 review_scores_communication 13.197830 review_scores_location 13.203338 review_scores_value 13.200584 license 8.771930 reviews_per_month 13.181305 Name: Total_perc, dtype: float64
# Spalte: "name"
df_Rome["name"] = df_Rome["name"].str.strip().fillna("Unbenannt")
# Spalte: "license" (sehr hohe Missing-Rate 85%)
df_Rome["license"] = df_Rome["license"].str.strip().fillna("Unbenannt")
# Spalte: "host_since"
s = df_Rome["host_since"]
s = s.mask(s > pd.Timestamp.today()) # Zukunft → NaT
upper_cap = s.quantile(0.995) # sehr späte Ausreißer
s = s.mask(s > upper_cap)
df_Rome["host_since"] = s.fillna(s.median()) # globaler Median pro Spalte
# Spalte: "host_verifications"
df_Rome["host_verifications"] = df_Rome["host_verifications"].fillna("[]").str.strip()
# Spalte: "host_identity_verified"
df_Rome["host_identity_verified"] = df_Rome["host_identity_verified"].fillna(False)
# Spalte: "bathrooms"
grp_cols = [c for c in ["room_type","accommodates"] if c in df_Rome.columns]
if grp_cols:
med_g = df_Rome.groupby(grp_cols)["bathrooms"].transform("median")
need = df_Rome["bathrooms"].isna()
df_Rome.loc[need, "bathrooms"] = med_g[need]
# Fallback global
df_Rome["bathrooms"] = df_Rome["bathrooms"].fillna(df_Rome["bathrooms"].median())
# Aufräumen
df_Rome["bathrooms"] = df_Rome["bathrooms"].clip(lower=0)
df_Rome["bathrooms"] = (np.round(df_Rome["bathrooms"] * 2) / 2) # 0.5-Schritte
# Spalte: "bedrooms"
grp_cols = [c for c in ["room_type","accommodates"] if c in df_Rome.columns]
if grp_cols:
med_g = df_Rome.groupby(grp_cols)["bedrooms"].transform("median")
need = df_Rome["bedrooms"].isna()
df_Rome.loc[need, "bedrooms"] = med_g[need]
df_Rome["bedrooms"] = df_Rome["bedrooms"].fillna(df_Rome["bedrooms"].median())
df_Rome["bedrooms"] = df_Rome["bedrooms"].clip(lower=0)
# Check
cols_report = ["name","host_since","host_verifications","host_identity_verified","bathrooms","bedrooms","license"]
present = [c for c in cols_report if c in df_Rome.columns]
print("Bereinigte Spalten:", present)
print(df_Rome[present].isna().sum().sort_values(ascending=False))
Bereinigte Spalten: ['name', 'host_since', 'host_verifications', 'host_identity_verified', 'bathrooms', 'bedrooms', 'license'] name 0 host_since 0 host_verifications 0 host_identity_verified 0 bathrooms 0 bedrooms 0 license 0 dtype: int64
missing_values(df_Rome)[missing["Total_perc"] > 0].Total_perc
neighbourhood_group_cleansed 51.372938 minimum_minimum_nights 0.027541 maximum_minimum_nights 0.027541 minimum_maximum_nights 0.027541 maximum_maximum_nights 0.027541 review_scores_rating 13.181305 review_scores_accuracy 13.200584 review_scores_cleanliness 13.197830 review_scores_checkin 13.200584 review_scores_communication 13.197830 review_scores_location 13.203338 review_scores_value 13.200584 reviews_per_month 13.181305 Name: Total_perc, dtype: float64
Feature Engineering¶
Host since year erstellen¶
Die Vorgehensweise ist hier wie folgt: Spalte "host_since_year" erstellen und Jahr aus "host_since" extrahieren
Ziel: Jahr für einfache Gruppierungen/Trends ableiten.
Voraussetzung: "host_since" ist bereits datetime64.
df_Rome["host_since"].head()
0 2008-09-18 1 2008-10-08 2 2009-10-09 3 2009-03-30 4 2010-01-30 Name: host_since, dtype: datetime64[ns]
# extract year from "host_since"
df_Rome["host_since_year"] = df_Rome["host_since"].dt.year.astype("int64")
df_Rome["host_since_year"].head()
0 2008 1 2008 2 2009 3 2009 4 2010 Name: host_since_year, dtype: int64
# Erstellen der Sicherung (save dataframe to csv file)
df_Rome.to_csv(f"{file_Rome[:-4]}_clean.csv")
EDA-Plan (Airbnb New York)¶
Univariat (Verteilungen)¶
- Preis ("price")
- Verteilung (Histogramm)"price" und "log(price)"
- Kapazität & Ausstattung
- "accommodates", "bedrooms", "bathrooms" -> Histogramm und Boxplot
- Verfügbarkeit
- "availability_365"
- Bewertungen
- "number_of_reviews", "reviews_per_month", Scores "review_scores_*"
- Kategorien
- "room_type", "property_type", "neighbourhood_group_cleansed"
file_NY = "data/NY/listings_clean.csv"
df_NY = pd.read_csv(file_NY, low_memory=False)
df_NY.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36322 entries, 0 to 36321 Data columns (total 50 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 36322 non-null int64 1 id 36322 non-null int64 2 name 36322 non-null object 3 host_id 36322 non-null int64 4 host_since 36322 non-null object 5 host_verifications 36322 non-null object 6 host_identity_verified 36322 non-null bool 7 neighbourhood_cleansed 36322 non-null object 8 neighbourhood_group_cleansed 36322 non-null object 9 latitude 36322 non-null float64 10 longitude 36322 non-null float64 11 property_type 36322 non-null object 12 room_type 36322 non-null object 13 accommodates 36322 non-null int64 14 bathrooms 36322 non-null float64 15 bedrooms 36322 non-null float64 16 amenities 36322 non-null object 17 price 36322 non-null float64 18 minimum_nights 36322 non-null int64 19 maximum_nights 36322 non-null int64 20 minimum_minimum_nights 36322 non-null int64 21 maximum_minimum_nights 36322 non-null int64 22 minimum_maximum_nights 36322 non-null int64 23 maximum_maximum_nights 36322 non-null int64 24 minimum_nights_avg_ntm 36322 non-null float64 25 maximum_nights_avg_ntm 36322 non-null float64 26 has_availability 36322 non-null bool 27 availability_30 36322 non-null int64 28 availability_60 36322 non-null int64 29 availability_90 36322 non-null int64 30 availability_365 36322 non-null int64 31 number_of_reviews 36322 non-null int64 32 number_of_reviews_ltm 36322 non-null int64 33 number_of_reviews_l30d 36322 non-null int64 34 availability_eoy 36322 non-null int64 35 number_of_reviews_ly 36322 non-null int64 36 estimated_occupancy_l365d 36322 non-null int64 37 first_review 36322 non-null object 38 last_review 36322 non-null object 39 review_scores_rating 36322 non-null float64 40 review_scores_accuracy 36322 non-null float64 41 review_scores_cleanliness 36322 non-null float64 42 review_scores_checkin 36322 non-null float64 43 review_scores_communication 36322 non-null float64 44 review_scores_location 36322 non-null float64 45 review_scores_value 36322 non-null float64 46 license 36322 non-null object 47 instant_bookable 36322 non-null bool 48 reviews_per_month 36322 non-null float64 49 host_since_year 36322 non-null int64 dtypes: bool(3), float64(15), int64(21), object(11) memory usage: 13.1+ MB
# Unnamed: 0 als reine Index-Spalte aus EDA ausschließen
df_NY = df_NY.drop(columns="Unnamed: 0", axis=1, errors="ignore")
for c in ["host_since", "first_review", "last_review"]:
df_NY[c] = pd.to_datetime(df_NY[c], errors="coerce")
# Preis (raw & log1p)
fig, axes = plt.subplots(1, 2, figsize=(12,4))
sns.histplot(data=df_NY, x="price", bins=50, ax=axes[0])
axes[0].set_title("Preis (USD/Nacht)")
sns.histplot(data=df_NY, x=np.log1p(df_NY["price"]), bins=50, ax=axes[1])
axes[1].set_title("log(1+Preis)")
plt.show()
# Kapazität & Ausstattung
for col in ["accommodates","bedrooms","bathrooms"]:
if col in df_NY.columns:
fig, ax = plt.subplots(figsize=(6,4))
sns.histplot(data=df_NY, x=col, bins=40, ax=ax)
ax.set_title(f"Verteilung: {col}")
plt.show()
# Verfügbarkeit (Beispiel: 365)
fig, ax = plt.subplots(figsize=(6,4))
sns.histplot(data=df_NY, x="availability_365", bins=50, ax=ax)
ax.set_title("Verfügbarkeit (365 Tage)")
plt.tight_layout()
# Reviews: Anzahl / Rate / Score
for col in ["number_of_reviews", "reviews_per_month", "review_scores_rating"]:
fig, ax = plt.subplots(figsize=(6,4))
sns.histplot(data=df_NY, x=col, bins=50, ax=ax)
ax.set_title(f"Verteilung: {col}")
plt.tight_layout()
# Kategoriale Häufigkeiten
for col in ["room_type","property_type", "neighbourhood_group_cleansed"]:
fig, ax = plt.subplots(figsize=(7,4))
order = df_NY[col].value_counts().index[:20] # Top 20
sns.countplot(data=df_NY, y=col, order=order, ax=ax)
ax.set_title(f"Häufigkeiten: {col}")
plt.show()
Bivariat: Korrelation und Gruppenvergleiche¶
- Numerisch vs numerisch (Spearman bevorzugt)
- "price" vs "accommodates", "bathrooms", "bedrooms", "availability_365", "reviews_per_month", "estimated_occupancy_l365d"
- Visuals: Scatter, Korrelations-Heatmap
- Kategorial vs numerisch (Gruppenvergleiche)
- "price" nach "room_type" & "neighbourhood_group_cleansed" -> Pivot, Boxplot
- "reviews_per_month" nach "neighbourhood_group_cleansed"
- Kategorial vs kategorial
- "room_type × neighbourhood_group_cleansed" -> Kreuztabelle, Balkendiagramm, Heatmap
# Spearman-Korrelation (numeric)
num_cols = [c for c in df_NY.columns if pd.api.types.is_numeric_dtype(df_NY[c]) and c not in ["id"]]
corr_s = df_NY[num_cols].corr(method="spearman")
plt.figure(figsize=(10,8))
sns.heatmap(corr_s, cmap="coolwarm", vmin=-1, vmax=1, center=0)
plt.title("Spearman-Korrelation (numeric)")
plt.show()
# Scatter: price vs accommodates
mask = df_NY["price"] < 2000
plt.figure()
plt.scatter(df_NY["accommodates"], np.log1p(df_NY["price"]), alpha=0.3, s=8)
plt.title("price vs accommodates")
plt.xlabel("accommodates"); plt.ylabel("price")
plt.show()
# Boxplots: price nach Room Type / Accommodates
mask = df_NY["price"] < 2000
plt.figure(figsize=(6,4))
df_NY[mask].boxplot(column="price", by="room_type")
plt.title("Preis nach Room Type"); plt.suptitle("")
plt.xlabel("room_type"); plt.ylabel("price")
plt.show()
plt.figure(figsize=(6,4))
df_NY[mask].boxplot(column="price", by="accommodates")
plt.title("Preis nach Accommodates"); plt.suptitle("")
plt.xlabel("Accommodates"); plt.ylabel("price")
plt.show()
<Figure size 600x400 with 0 Axes>
<Figure size 600x400 with 0 Axes>
# Pivot: Median-Preis Borough (administrative unit) × Room Type
pivot = df_NY.pivot_table(
index="neighbourhood_group_cleansed",
columns="room_type",
values="price",
aggfunc="median"
).round(2)
print("Median-Preis (USD/Nacht): Borough × Room Type")
display(pivot)
Median-Preis (USD/Nacht): Borough × Room Type
| room_type | Entire home/apt | Hotel room | Private room | Shared room |
|---|---|---|---|---|
| neighbourhood_group_cleansed | ||||
| Bronx | 125.5 | NaN | 68.0 | 37.0 |
| Brooklyn | 174.0 | 40000.0 | 90.0 | 40.0 |
| Manhattan | 235.0 | 20659.5 | 140.5 | 57.0 |
| Queens | 150.0 | 237.0 | 85.0 | 76.0 |
| Staten Island | 120.0 | NaN | 75.0 | 88.5 |
# Ordnung nach Häufigkeit (optional, macht Plots übersichtlicher)
row_order = df_NY["room_type"].value_counts().index
col_order = df_NY["neighbourhood_group_cleansed"].value_counts().index
# Counts
ct = pd.crosstab(
df_NY["room_type"], df_NY["neighbourhood_group_cleansed"]
).reindex(index=row_order, columns=col_order)
# Zeilenprozente (Anteile je room_type)
ct_row = pd.crosstab(
df_NY["room_type"], df_NY["neighbourhood_group_cleansed"], normalize="index"
).reindex(index=row_order, columns=col_order) * 100
print("Counts:\n", ct, "\n")
print("Row %:\n", ct_row.round(1), "\n")
Counts: neighbourhood_group_cleansed Manhattan Brooklyn Queens Bronx \ room_type Entire home/apt 10281 6543 1977 454 Private room 5421 6679 3326 737 Hotel room 307 19 9 0 Shared room 72 88 43 1 neighbourhood_group_cleansed Staten Island room_type Entire home/apt 180 Private room 183 Hotel room 0 Shared room 2 Row %: neighbourhood_group_cleansed Manhattan Brooklyn Queens Bronx \ room_type Entire home/apt 52.9 33.7 10.2 2.3 Private room 33.2 40.9 20.3 4.5 Hotel room 91.6 5.7 2.7 0.0 Shared room 35.0 42.7 20.9 0.5 neighbourhood_group_cleansed Staten Island room_type Entire home/apt 0.9 Private room 1.1 Hotel room 0.0 Shared room 1.0
# Heatmap (Zeilenprozente)
plt.figure(figsize=(7,4))
sns.heatmap(ct_row, annot=True, fmt=".1f", cmap="Blues")
plt.title("room_type × Borough — Row %")
plt.xlabel("neighbourhood_group_cleansed"); plt.ylabel("room_type")
plt.tight_layout()
plt.show()
# Gestapeltes Balkendiagramm (Zeilenprozente)
ax = ct_row.plot(kind="bar", stacked=True, figsize=(7,4))
ax.set_ylabel("Anteil (%)"); ax.set_title("room_type × Borough — Row % (gestapelt)")
plt.legend(title="neighbourhood_group_cleansed", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
plt.show()
Multivariat¶
- Pairplot auf Variablen: "price, accommodates, bedrooms, bathrooms, reviews_per_mont"
- Preis-Modell (deskriptiv, nicht prädiktiv)
- OLS Modelle (statsmodels): "log(price) ~ accommodates + bedrooms + bathrooms + room_type + neighbourhood_group_cleansed + instant_bookable + review_scores_rating"
# Pairplot auf ausgewählten Variablen (Downsampling für Performance)
cols_pp = [c for c in ["price","accommodates","bedrooms","bathrooms","reviews_per_month"]]
sample_pp = df_NY[cols_pp + (["room_type"] if "room_type" in df_NY.columns else [])].sample(min(10000, len(df_NY)), random_state=42)
sns.pairplot(sample_pp, hue="room_type" if "room_type" in sample_pp.columns else None, diag_kind="hist")
plt.suptitle("Pairplot: ausgewählte Variablen", y=1.02)
Text(0.5, 1.02, 'Pairplot: ausgewählte Variablen')
# Multivariat: einfache Modelle (statsmodels)
# OLS: log(price) ~ Prädiktoren
df_NY_mod = df_NY.copy()
df_NY_mod = df_NY_mod[df_NY_mod["price"] > 0].copy()
df_NY_mod["log_price"] = np.log(df_NY_mod["price"])
# Formel
formula_ols = "log_price ~ accommodates + bedrooms + bathrooms + C(room_type) + C(neighbourhood_group_cleansed) + instant_bookable + review_scores_rating"
model_ols = smf.ols(formula=formula_ols, data=df_NY_mod).fit()
print(model_ols.summary())
OLS Regression Results
==============================================================================
Dep. Variable: log_price R-squared: 0.654
Model: OLS Adj. R-squared: 0.654
Method: Least Squares F-statistic: 5710.
Date: Wed, 17 Sep 2025 Prob (F-statistic): 0.00
Time: 12:35:04 Log-Likelihood: -24862.
No. Observations: 36322 AIC: 4.975e+04
Df Residuals: 36309 BIC: 4.986e+04
Df Model: 12
Covariance Type: nonrobust
====================================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------------------------
Intercept 3.6210 0.043 85.168 0.000 3.538 3.704
C(room_type)[T.Hotel room] 2.7949 0.027 103.640 0.000 2.742 2.848
C(room_type)[T.Private room] -0.4671 0.006 -79.693 0.000 -0.479 -0.456
C(room_type)[T.Shared room] -0.8188 0.034 -24.240 0.000 -0.885 -0.753
C(neighbourhood_group_cleansed)[T.Brooklyn] 0.2477 0.015 17.034 0.000 0.219 0.276
C(neighbourhood_group_cleansed)[T.Manhattan] 0.6583 0.014 45.418 0.000 0.630 0.687
C(neighbourhood_group_cleansed)[T.Queens] 0.1077 0.015 7.003 0.000 0.078 0.138
C(neighbourhood_group_cleansed)[T.Staten Island] 0.0090 0.029 0.313 0.754 -0.047 0.065
instant_bookable[T.True] 0.2456 0.007 37.648 0.000 0.233 0.258
accommodates 0.1544 0.002 85.544 0.000 0.151 0.158
bedrooms -0.0196 0.004 -5.381 0.000 -0.027 -0.012
bathrooms 0.0726 0.006 12.646 0.000 0.061 0.084
review_scores_rating 0.1425 0.008 17.351 0.000 0.126 0.159
==============================================================================
Omnibus: 7981.767 Durbin-Watson: 1.683
Prob(Omnibus): 0.000 Jarque-Bera (JB): 91795.925
Skew: 0.729 Prob(JB): 0.00
Kurtosis: 10.650 Cond. No. 107.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Zeitliche Charachteristiks¶
- Host-Kohorten
- "host_since_year" -> Entwicklung der Host-Neuzugänge über die Jahre
- Aktualität
- "reviews_per_month" -> Aktivität nach Borough: Median reviews_per_month
# Hosts pro Jahr (Kohorten)
coh = df_NY.groupby("host_since_year")["host_id"].nunique().reset_index()
sns.barplot(data=coh, x="host_since_year", y="host_id")
plt.title("Neue Hosts pro Jahr")
plt.xlabel("Jahr")
plt.ylabel("Anzahl Hosts")
plt.xticks(rotation=45)
plt.show()
# Aktivität nach Borough: Median reviews_per_month
rpm = df_NY.groupby("neighbourhood_group_cleansed")["reviews_per_month"].median().reset_index()
plt.figure(figsize=(7,3.5))
sns.barplot(data=rpm, x="neighbourhood_group_cleansed", y="reviews_per_month")
plt.title("Median Reviews/Monat nach Borough")
plt.show()
Visualisierung der Listings (Plotly Express)¶
Geo-Visualisierung der Listings New Jork¶
Ziel: New York Airbnb-Listings sichtbar machen – nach Borough, Preis, Dichte und Objekttyp.
Datenbasis & Kartenstil
- Maskenbildung:** Einmalige
maskaufprice< 1500. - Verwendet wurde ein Sample des Datensatzes (
sample) zur Performance. - Kartenzentrum: Median von Latitude/Longitude; Stil: carto-positron.
- Hover-Infos:
name,neighbourhood_cleansed,room_type,price(bzw.property_type,availability_365). Visualisierungen - Punktekarte – eingefärbt nach Borough (
neighbourhood_group_cleansed)
-> Zeigt die räumliche Verteilung der Listings über die Boroughs und lokale Häufungen. - Punktekarte – eingefärbt nach Preis (kontinuierlich)
-> Hebt Preis-Hotspots und günstigere Zonen hervor; erleichtert visuelle Outlier-Erkennung. - Dichtekarte (Heatmap)
-> Visualisiert Listing-Cluster unabhängig von der Farbe. Der Parameterradiussteuert die Glättung (größer = glatter). - Punktekarte – eingefärbt nach Objekttyp (
property_type)
-> Zeigt den räumlichen Mix der Unterkunftsarten und mögliche Typ-Cluster.
Interpretationshinweise
- Bei Punktkarten können sich Marker überlagern; die Dichtekarte ergänzt hier den Überblick.
- Farbskalen unterstützen das Erkennen von Hotspots und Strukturunterschieden zwischen Stadtteilen.
- Für belastbare Schlüsse mit Aggregaten (z. B. Median-Preis je Borough) gegenprüfen.
# Map center from data
center = dict(lat=df_NY["latitude"].median(), lon=df_NY["longitude"].median())
mask = df_NY["price"] < 1500
sample = df_NY[mask]
# Downsample for performance (adjust N as you like)
# sample = df_NY.sample(min(80000, len(df_NY)), random_state=42)
# Points map, colored by borough
fig = px.scatter_map(
sample,
lat="latitude", lon="longitude",
color="neighbourhood_group_cleansed",
hover_data=["name","neighbourhood_cleansed","room_type","price"],
zoom=9, center=center, height=600,
)
fig.update_layout(mapbox_style="carto-positron", margin=dict(l=0,r=0,t=40,b=0), title="Listings by Borough")
fig.show()
# Points map, colored by price (continuous)
fig = px.scatter_map(
sample,
lat="latitude", lon="longitude",
color="price",
hover_data=["name","neighbourhood_cleansed","room_type","price"],
zoom=9, center=center, height=600,
)
fig.update_layout(mapbox_style="carto-positron", margin=dict(l=0,r=0,t=40,b=0), title="Listings colored by Price")
fig.show()
# Density map (where listings cluster)
fig = px.density_map(
sample,
lat="latitude", lon="longitude",
z=None, # or "price" to weight by price
radius=8, # increase for smoother heat
center=center, zoom=9, height=600,
)
fig.update_layout(mapbox_style="carto-positron", margin=dict(l=0,r=0,t=40,b=0), title="Listing Density (Heat)")
fig.show()
# Points map, colored by property_type
fig = px.scatter_map(
sample,
lat="latitude", lon="longitude",
color="property_type",
hover_data=["name","neighbourhood_cleansed","room_type","property_type"],
zoom=9, center=center, height=600,
)
fig.update_layout(mapbox_style="carto-positron", margin=dict(l=0,r=0,t=40,b=0), title="Listings colored by Property type")
fig.show()
# Points map, colored by availability_365
fig = px.scatter_map(
sample,
lat="latitude", lon="longitude",
color="availability_365",
hover_data=["name","neighbourhood_cleansed","room_type","availability_365"],
zoom=9, center=center, height=600,
)
fig.update_layout(mapbox_style="carto-positron", margin=dict(l=0,r=0,t=40,b=0), title="Listings colored by availability_365")
fig.show()
Dash mit Diagramm und Schieberegler: Airbnb NYC – Jahresfilter: Verfügbarkeit vs Preis¶
Ziel: Interaktive Punktekarte (availability_365 vs. price) mit Jahres-Slider.
Farbe = Borough (neighbourhood_group_cleansed), Punktgröße = number_of_reviews.
Schritte
- Slider-Jahre: Aus
host_since_yearermittelt. - Callback: Filter als eine kombinierte Bedingung.
- Plot:
px.scattermittemplate="plotly_white", Farbe = Borough, Größe = Reviews, Hover zeigtroom_type,bedrooms,bathrooms.
HTML-Export & Öffnen
- Speichern als HTML (self-contained):
- Link im Standard-Browser öffnen:
- Link im Notebook einbetten:
Interpretation
- Steigung/Form: Beziehung zwischen Verfügbarkeit (365 Tage) und Preis.
- Farbcluster: Räumliche Segmente (Boroughs).
- Punktgröße: Signalisiert Popularität/Aktivität über Reviews.
out_file_NY = "airbnb_price_availability_yearslider_NY.html"
# slider years
years = sorted([int(y) for y in df_NY["host_since_year"].unique()])
initial_year = years[0]
app = Dash(__name__)
app.layout = html.Div([
dcc.Graph(id="graph-with-slider"),
dcc.Slider(
min=years[0], max=years[-1], step=None, value=initial_year,
marks={str(y): str(y) for y in years},
id="year-slider"
)
])
@app.callback(
Output("graph-with-slider", "figure"),
Input("year-slider", "value")
)
def update_figure(selected_year):
# align mask to df_NY
cond = (df_NY["host_since_year"] == selected_year) & mask
df_NYf = df_NY.loc[cond]
if df_NYf.empty:
fig.update_layout(template="plotly_white")
return fig
# Example scatter: x=availability_36, y=price, color=borough, size=number_of_reviews
fig = px.scatter(
df_NYf,
x="availability_365", y="price",
color="neighbourhood_group_cleansed",
size="number_of_reviews",
hover_data=["room_type", "bedrooms", "bathrooms"],
size_max=18,
title=f"Listings – {selected_year}",
template="plotly_white"
)
# Save as HTML
fig.write_html(out_file_NY, include_plotlyjs="cdn", full_html=True)
fig.update_layout(transition_duration=500, legend_title_text="Borough")
return fig
if __name__ == "__main__":
app.run(debug=True, port=8050)
# Embed the link inside the notebook
# IFrame(src=out_file_NY, width="100%", height=600)
# Open directly in the default browser
# webbrowser.open("file://" + os.path.abspath(out_file_NY))
Dashboard: Tabelle und Histogramm mit Radio-Auswahl. Komponenten nebeneinander zeigen.¶
Ziel: Ein Dash-Layout, das eine Daten-Tabelle und ein Histogramm anzeigt. Per Radio-Buttons wählst du die Kennzahl, die im Plot aggregiert dargestellt wird.
Aufbau
- Styling: Einbindung eines externen CSS (CodePen) für das Grid-Layout.
- Kontrollelemente:
dcc.RadioItemszur Auswahl der Metrik
(Optionen:price,review_scores_rating,availability_365). - Inhalt in zwei Spalten:
- Links: Dash Tabelle mit den Datensätzen.
- Rechts: Dash Graph für das Histogramm.
Interaktion (Callback)
- Input: Wert der Radio-Buttons.
- Output: Aktualisiert die Figur.
- Logik:
px.histogrammit
x='neighbourhood_group_cleansed',
y=col_chosen,
histfunc='avg'-> zeigt durchschnittliche Werte je Borough.
Start
- Serverstart über
app.run(debug=True, port=8051)und Aufruf im Browser (lokale URL).
# Initialize the app - incorporate css
external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
app = Dash(external_stylesheets=external_stylesheets)
# App layout
app.layout = [
html.Div(className='row', children='Tabelle, Kontrollelemente und Histogramm mit durchschnittliche Werte je Borough',
style={'textAlign': 'center', 'color': 'blue', 'fontSize': 30}),
html.Div(className='row', children=[
dcc.RadioItems(options=['price', 'review_scores_rating', 'availability_365'],
value='review_scores_rating',
inline=True,
id='my-radio-buttons-final')
]),
html.Div(className='row', children=[
html.Div(className='six columns', children=[
dash_table.DataTable(data=df_NY.to_dict('records'), page_size=11, style_table={'overflowX': 'auto'})
]),
html.Div(className='six columns', children=[
dcc.Graph(figure={}, id='histo-chart-final')
])
])
]
# Add controls to build the interaction
@callback(
Output(component_id='histo-chart-final', component_property='figure'),
Input(component_id='my-radio-buttons-final', component_property='value')
)
def update_graph(col_chosen):
fig = px.histogram(df_NY, x='neighbourhood_group_cleansed', y=col_chosen, histfunc='avg')
return fig
# Run the app
if __name__ == '__main__':
app.run(debug=True, port=8051)
Geo-Visualisierung der Listings Rome¶
Ziel: Rome Airbnb-Listings sichtbar machen – nach Borough, Preis, Dichte und Objekttyp.
Datenbasis & Kartenstil
- Maskenbildung:** Einmalige
maskaufprice< 1500. - Verwendet wurde ein Sample des Datensatzes (
sample) zur Performance. - Kartenzentrum: Median von Latitude/Longitude; Stil: carto-positron.
- Hover-Infos:
name,neighbourhood_cleansed,room_type,price(bzw.property_type,availability_365). Visualisierungen - Punktekarte – eingefärbt nach Borough (
neighbourhood_group_cleansed)
-> Zeigt die räumliche Verteilung der Listings über die Boroughs und lokale Häufungen. - Punktekarte – eingefärbt nach Preis (kontinuierlich)
-> Hebt Preis-Hotspots und günstigere Zonen hervor; erleichtert visuelle Outlier-Erkennung. - Dichtekarte (Heatmap)
-> Visualisiert Listing-Cluster unabhängig von der Farbe. Der Parameterradiussteuert die Glättung (größer = glatter). - Punktekarte – eingefärbt nach Objekttyp (
property_type)
-> Zeigt den räumlichen Mix der Unterkunftsarten und mögliche Typ-Cluster.
Interpretationshinweise
- Bei Punktkarten können sich Marker überlagern; die Dichtekarte ergänzt hier den Überblick.
- Farbskalen unterstützen das Erkennen von Hotspots und Strukturunterschieden zwischen Stadtteilen.
- Für belastbare Schlüsse mit Aggregaten (z. B. Median-Preis je Borough) gegenprüfen.
file_Rome = "data/Rome/listings_clean.csv"
df_Rome = pd.read_csv(file_Rome, low_memory=False)
df_Rome = df_Rome.drop(columns="Unnamed: 0", axis=1, errors="ignore")
df_Rome.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36309 entries, 0 to 36308 Data columns (total 49 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 36309 non-null int64 1 name 36309 non-null object 2 host_id 36309 non-null int64 3 host_since 36309 non-null object 4 host_verifications 36309 non-null object 5 host_identity_verified 36309 non-null bool 6 neighbourhood_cleansed 36309 non-null object 7 neighbourhood_group_cleansed 17656 non-null object 8 latitude 36309 non-null float64 9 longitude 36309 non-null float64 10 property_type 36309 non-null object 11 room_type 36309 non-null object 12 accommodates 36309 non-null int64 13 bathrooms 36309 non-null float64 14 bedrooms 36309 non-null float64 15 amenities 36309 non-null object 16 price 36309 non-null float64 17 minimum_nights 36309 non-null int64 18 maximum_nights 36309 non-null int64 19 minimum_minimum_nights 36299 non-null float64 20 maximum_minimum_nights 36299 non-null float64 21 minimum_maximum_nights 36299 non-null float64 22 maximum_maximum_nights 36299 non-null float64 23 minimum_nights_avg_ntm 36309 non-null float64 24 maximum_nights_avg_ntm 36309 non-null float64 25 has_availability 36309 non-null bool 26 availability_30 36309 non-null int64 27 availability_60 36309 non-null int64 28 availability_90 36309 non-null int64 29 availability_365 36309 non-null int64 30 number_of_reviews 36309 non-null int64 31 number_of_reviews_ltm 36309 non-null int64 32 number_of_reviews_l30d 36309 non-null int64 33 availability_eoy 36309 non-null int64 34 number_of_reviews_ly 36309 non-null int64 35 estimated_occupancy_l365d 36309 non-null int64 36 first_review 36309 non-null object 37 last_review 36309 non-null object 38 review_scores_rating 31523 non-null float64 39 review_scores_accuracy 31516 non-null float64 40 review_scores_cleanliness 31517 non-null float64 41 review_scores_checkin 31516 non-null float64 42 review_scores_communication 31517 non-null float64 43 review_scores_location 31515 non-null float64 44 review_scores_value 31516 non-null float64 45 license 36309 non-null object 46 instant_bookable 36309 non-null bool 47 reviews_per_month 31523 non-null float64 48 host_since_year 36309 non-null int64 dtypes: bool(3), float64(19), int64(16), object(11) memory usage: 12.8+ MB
# Map center from data
center = dict(lat=df_Rome["latitude"].median(), lon=df_Rome["longitude"].median())
mask = df_Rome["price"] < 1500
sample = df_Rome[mask]
# Downsample for performance (adjust N as you like)
# sample = df_Rome.sample(min(80000, len(df_Rome)), random_state=42)
# Points map, colored by borough
fig = px.scatter_map(
sample,
lat="latitude", lon="longitude",
color="neighbourhood_group_cleansed",
hover_data=["name","neighbourhood_cleansed","room_type","price"],
zoom=9, center=center, height=600,
)
fig.update_layout(mapbox_style="carto-positron", margin=dict(l=0,r=0,t=40,b=0), title="Listings by Borough")
fig.show()
# Points map, colored by price (continuous)
fig = px.scatter_map(
sample,
lat="latitude", lon="longitude",
color="price",
hover_data=["name","neighbourhood_cleansed","room_type","price"],
zoom=9, center=center, height=600,
)
fig.update_layout(mapbox_style="carto-positron", margin=dict(l=0,r=0,t=40,b=0), title="Listings colored by Price")
fig.show()
# Density map (where listings cluster)
fig = px.density_map(
sample,
lat="latitude", lon="longitude",
z=None, # or "price" to weight by price
radius=8, # increase for smoother heat
center=center, zoom=9, height=600,
)
fig.update_layout(mapbox_style="carto-positron", margin=dict(l=0,r=0,t=40,b=0), title="Listing Density (Heat)")
fig.show()
# Points map, colored by property_type
fig = px.scatter_map(
sample,
lat="latitude", lon="longitude",
color="property_type",
hover_data=["name","neighbourhood_cleansed","room_type","property_type"],
zoom=9, center=center, height=600,
)
fig.update_layout(mapbox_style="carto-positron", margin=dict(l=0,r=0,t=40,b=0), title="Listings colored by Property type")
fig.show()
# Points map, colored by availability_365
fig = px.scatter_map(
sample,
lat="latitude", lon="longitude",
color="availability_365",
hover_data=["name","neighbourhood_cleansed","room_type","availability_365"],
zoom=9, center=center, height=600,
)
fig.update_layout(mapbox_style="carto-positron", margin=dict(l=0,r=0,t=40,b=0), title="Listings colored by availability_365")
fig.show()
Dash mit Diagramm und Schieberegler: Airbnb Rome – Jahresfilter: Verfügbarkeit & Preis¶
Ziel: Interaktive Punktekarte (availability_365 vs. price) mit Jahres-Slider.
Farbe = Borough (neighbourhood_group_cleansed), Punktgröße = number_of_reviews.
Schritte
- Slider-Jahre: Aus
host_since_yearermittelt. - Callback: Filter als eine kombinierte Bedingung.
- Plot:
px.scattermittemplate="plotly_white", Farbe = Borough, Größe = Reviews, Hover zeigtroom_type,bedrooms,bathrooms.
HTML-Export & Öffnen
- Speichern als HTML (self-contained):
- Link im Standard-Browser öffnen:
- Link im Notebook einbetten:
Interpretation
- Steigung/Form: Beziehung zwischen Verfügbarkeit (365 Tage) und Preis.
- Farbcluster: Räumliche Segmente (Boroughs).
- Punktgröße: Signalisiert Popularität/Aktivität über Reviews.
out_file_Rome = "airbnb_price_availability_yearslider_Rome.html"
# slider years
years = sorted([int(y) for y in df_Rome["host_since_year"].unique()])
initial_year = years[0]
app = Dash(__name__)
app.layout = html.Div([
dcc.Graph(id="graph-with-slider"),
dcc.Slider(
min=years[0], max=years[-1], step=None, value=initial_year,
marks={str(y): str(y) for y in years},
id="year-slider"
)
])
@app.callback(
Output("graph-with-slider", "figure"),
Input("year-slider", "value")
)
def update_figure(selected_year):
# align mask to df_Rome
cond = (df_Rome["host_since_year"] == selected_year) & mask
df_Romef = df_Rome.loc[cond]
if df_Romef.empty:
fig.update_layout(template="plotly_white")
return fig
# Example scatter: x=availability_36, y=price, color=borough, size=number_of_reviews
fig = px.scatter(
df_Romef,
x="availability_365", y="price",
color="neighbourhood_group_cleansed",
size="number_of_reviews",
hover_data=["room_type", "bedrooms", "bathrooms"],
size_max=18,
title=f"Listings – {selected_year}",
template="plotly_white"
)
# Save as HTML
fig.write_html(out_file_Rome, include_plotlyjs="cdn", full_html=True)
fig.update_layout(transition_duration=500, legend_title_text="Borough")
return fig
if __name__ == "__main__":
app.run(debug=True, port=8053)
# Embed the link inside the notebook
# IFrame(src=out_file_Rome, width="100%", height=600)
# Open directly in the default browser
# webbrowser.open("file://" + os.path.abspath(out_file_Rome))
Dashboard: Tabelle und Histogramm mit Radio-Auswahl. Komponenten nebeneinander zeigen.¶
Ziel: Ein Dash-Layout, das eine Daten-Tabelle und ein Histogramm anzeigt. Per Radio-Buttons wählst du die Kennzahl, die im Plot aggregiert dargestellt wird.
Aufbau
- Styling: Einbindung eines externen CSS (CodePen) für das Grid-Layout.
- Kontrollelemente:
dcc.RadioItemszur Auswahl der Metrik
(Optionen:price,review_scores_rating,availability_365). - Inhalt in zwei Spalten:
- Links: Dash Tabelle mit den Datensätzen.
- Rechts: Dash Graph für das Histogramm.
Interaktion (Callback)
- Input: Wert der Radio-Buttons.
- Output: Aktualisiert die Figur.
- Logik:
px.histogrammit
x='neighbourhood_group_cleansed',
y=col_chosen,
histfunc='avg'-> zeigt durchschnittliche Werte je Borough.
Start
- Serverstart über
app.run(debug=True, port=8054)und Aufruf im Browser (lokale URL).
# Initialize the app - incorporate css
external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
app = Dash(external_stylesheets=external_stylesheets)
# App layout
app.layout = [
html.Div(className='row', children='Tabelle, Kontrollelemente und Histogramm mit durchschnittliche Werte je Borough',
style={'textAlign': 'center', 'color': 'blue', 'fontSize': 30}),
html.Div(className='row', children=[
dcc.RadioItems(options=['price', 'review_scores_rating', 'availability_365'],
value='review_scores_rating',
inline=True,
id='my-radio-buttons-final')
]),
html.Div(className='row', children=[
html.Div(className='six columns', children=[
dash_table.DataTable(data=df_Rome.to_dict('records'), page_size=11, style_table={'overflowX': 'auto'})
]),
html.Div(className='six columns', children=[
dcc.Graph(figure={}, id='histo-chart-final')
])
])
]
# Add controls to build the interaction
@callback(
Output(component_id='histo-chart-final', component_property='figure'),
Input(component_id='my-radio-buttons-final', component_property='value')
)
def update_graph(col_chosen):
fig = px.histogram(df_Rome, x='neighbourhood_group_cleansed', y=col_chosen, histfunc='avg')
return fig
# Run the app
if __name__ == '__main__':
app.run(debug=True, port=8054)
Donut-Kreischarts: Lizenzstatus New York vs. Rome¶
Ziel: Prozentuale Verteilung der Lizenzangaben je Stadt als Donut.
Gruppenbildung
- New York (3 Sektoren):
Registriert (OSE) = Werte beginnend mitOSE-· Exempt = genauExempt· Unbenannt = leer/NaN/sonstiges. - Rom (4 Sektoren):
CIN vorhanden = Werte beginnend mitIT· CIN beantragt = Text enthält „applied for a CIN“ · Exempt = genauExempt· Unbenannt = leer/NaN/sonstiges.
Darstellung: Zwei Donuts nebeneinander (Plotly), Text = Prozent + Label, stadtspezifische Farben. Kategorien ohne Vorkommen werden automatisch weggelassen.
# Spalte "license" - Werte [nan, 'Exempt', "OSE-*"]
print(df_NY["license"].unique()[:5])
[i for i in df_NY["license"].unique() if str(i)[:4].upper() != "OSE-"]
['Unbenannt' 'OSE-STRREG-0000008' 'OSE-STRREG-0000923' 'OSE-STRREG-0000656' 'OSE-STRREG-0000108']
['Unbenannt', 'Exempt']
# Spalte "license" - Werte [nan, 'Exempt', "IT*"]
print(df_Rome["license"].unique()[:5])
[i for i in df_Rome["license"].unique() if str(i)[:2].upper() != "IT"]
['Unbenannt' 'IT058091C29VJSIZQZ' 'it058091c2kv6epw8f' 'IT058091C20YD35BX2' 'IT058091B4I3RFDVLG']
['Unbenannt', 'I applied for a CIN, but haven’t received it yet', 'Exempt']
# Donut-Kreischarts für Lizenz-Gruppen: New York (3 Sektoren) & Rome (4 Sektoren)
def group_license_ny(val):
if pd.isna(val):
return "Unbenannt"
s = str(val).strip()
up = s.upper()
if up.startswith("OSE-"): # z. B. OSE-STRREG-...
return "Registriert (OSE)"
if up == "EXEMPT":
return "Exempt"
if s == "" or up in {"UNBENANNT", "NAN"}:
return "Unbenannt"
# Alles, was nicht OSE-/Exempt ist, zu "Unbenannt" schieben (nur 3 Gruppen gewünscht)
return "Unbenannt"
def group_license_rome(val):
if pd.isna(val):
return "Unbenannt"
s = str(val).strip()
up = s.upper()
if up.startswith("IT"): # z. B. IT058091...
return "CIN vorhanden"
if "APPLIED FOR A CIN" in up: # "I applied for a CIN, but haven’t received it yet"
return "CIN beantragt"
if up == "EXEMPT":
return "Exempt"
if s == "" or up in {"UNBENANNT", "NAN"}:
return "Unbenannt"
# Rest ebenfalls zu "Unbenannt" (nur 4 Gruppen gewünscht)
return "Unbenannt"
# Häufigkeiten berechnen
ny_groups = df_NY["license"].apply(group_license_ny).value_counts()
rome_groups = df_Rome["license"].apply(group_license_rome).value_counts()
# Feste Reihenfolge (optional)
ny_order = ["Registriert (OSE)", "Exempt", "Unbenannt"]
rome_order = ["CIN vorhanden", "CIN beantragt", "Exempt", "Unbenannt"]
ny_labels = [g for g in ny_order if g in ny_groups.index]
ny_values = [int(ny_groups[g]) for g in ny_labels]
rome_labels = [g for g in rome_order if g in rome_groups.index]
rome_values = [int(rome_groups[g]) for g in rome_labels]
# Farben (3 für NY, 4 für Rom)
ny_colors = ["#1f77b4", "#ff7f0e", "#2ca02c"] # Blau/Orange/Grün
rome_colors = ["#636EFA", "#EF553B", "#00CC96", "#AB63FA"] # Blau/Rot/Grün/Lila
# Donut-Subplots zeichnen
fig = make_subplots(
rows=1, cols=2,
specs=[[{"type": "domain"}, {"type": "domain"}]],
subplot_titles=("New York", "Rome")
)
fig.add_trace(
go.Pie(
labels=ny_labels, values=ny_values,
hole=0.5, textinfo="percent+label",
marker=dict(colors=ny_colors[:len(ny_labels)])
),
1, 1
)
fig.add_trace(
go.Pie(
labels=rome_labels, values=rome_values,
hole=0.5, textinfo="percent+label",
marker=dict(colors=rome_colors[:len(rome_labels)])
),
1, 2
)
fig.update_layout(
title_text="Lizenzstatus: Verteilung von Lizenzgruppen",
margin=dict(l=20, r=20, t=60, b=20),
showlegend=False,
uniformtext_minsize=10,
uniformtext_mode="hide"
)
fig.show()
Natural Language Processing mit Spacy Bibliothek¶
Ausstattung Airbnb New York vs Airbnb Rome
Das Standardmodell ist "en_core_web_sm" (Englisch). Das Modell wird mit dem folgenden Befehl in der console heruntergeladen
# !python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")
df_NY["amenities"].values[:2]
array(['["Dedicated workspace", "Essentials", "Cooking basics", "Stove", "Blender", "Dishes and silverware", "Hot water", "Dining table", "Dryer", "Washer", "Lock on bedroom door", "Kitchen", "Bathtub", "Oven", "Refrigerator", "Wine glasses", "Single level home", "Fire extinguisher", "Fast wifi \\u2013 330 Mbps", "Luggage dropoff allowed", "Microwave", "Bed linens", "Cleaning available during stay", "Conditioner", "Smoke alarm", "Elevator", "Body soap", "Clothing storage", "Self check-in", "Hair dryer", "Backyard", "Carbon monoxide alarm", "Freezer", "Heating", "Shower gel", "Coffee maker", "Paid parking off premises", "Extra pillows and blankets", "Cleaning products", "Long term stays allowed", "Free street parking", "Shampoo", "Laundromat nearby", "Lockbox"]',
'["Dedicated workspace", "Essentials", "Cooking basics", "Stove", "Dishes and silverware", "TV", "Hot water", "Kitchen", "Keypad", "Bathtub", "Hangers", "Refrigerator", "Oven", "Fire extinguisher", "Luggage dropoff allowed", "Bed linens", "Cleaning available during stay", "Baking sheet", "Smoke alarm", "Self check-in", "Hair dryer", "Wifi", "Iron", "Carbon monoxide alarm", "Ethernet connection", "Heating", "Air conditioning", "Coffee maker", "Paid parking off premises", "Extra pillows and blankets", "Long term stays allowed", "Free street parking"]'],
dtype=object)
df_Rome["amenities"].values[:2]
array(['["Cleaning products", "Dryer", "Bed linens", "Outdoor dining area", "Hot tub", "Dining table", "Luggage dropoff allowed", "Oven", "Wifi", "Free street parking", "Microwave", "Free parking on premises", "Laundromat nearby", "Drying rack for clothing", "Kitchen", "Heating", "Freezer", "Coffee maker", "Patio or balcony", "Washer", "Hot water kettle", "Dishwasher", "Hangers", "Toaster", "Refrigerator", "Long term stays allowed", "Hot water", "Wine glasses", "Portable fans", "Dedicated workspace", "Cooking basics", "Hair dryer", "High chair", "Dishes and silverware", "First aid kit", "Elevator", "Bidet", "Blender", "Housekeeping - available at extra cost", "Air conditioning", "Stove", "Iron"]',
'["Dryer", "Bed linens", "Luggage dropoff allowed", "Extra pillows and blankets", "Wifi", "Shampoo", "Kitchen", "Heating", "Coffee maker", "Hangers", "Ethernet connection", "Refrigerator", "Long term stays allowed", "Hot water", "Dedicated workspace", "Hair dryer", "Paid parking off premises", "Essentials", "Dishes and silverware", "Pocket wifi", "Elevator", "Window AC unit", "HDTV with standard cable", "Host greets you", "Free washer \\u2013 In unit", "Iron"]'],
dtype=object)
df_NY["amenities"].values[5].replace("[", "").replace("]", "")
df_Rome["amenities"].values[5].replace("[", "").replace("]", "")
'"Cleaning products", "Bed linens", "Dining table", "Extra pillows and blankets", "Shower gel", "Room-darkening shades", "Central air conditioning", "City skyline view", "Free street parking", "Smoking allowed", "Cleaning available during stay", "Microwave", "Crib - available upon request", "Books and reading material", "Shampoo", "Drying rack for clothing", "Laundromat nearby", "Kitchen", "Freezer", "Coffee maker", "Hot water kettle", "Paid parking garage on premises", "Clothing storage: closet and wardrobe", "Hangers", "Toaster", "Central heating", "Single level home", "Refrigerator", "Long term stays allowed", "Gas stove", "Hot water", "Wine glasses", "Dedicated workspace", "Cooking basics", "Hair dryer", "Paid parking off premises", "Essentials", "Coffee", "Pets allowed", "Fast wifi \\u2013 54 Mbps", "Dishes and silverware", "Fire extinguisher", "24 inch HDTV", "Elevator", "Ceiling fan", "Host greets you", "Body soap", "Blender", "Free washer \\u2013 In unit", "Iron"'
len(df_NY)//4, len(df_Rome)//4
(9080, 9077)
text_NY = "".join([i.replace("[", "").replace("]", "") for i in df_NY["amenities"].values[:10000]])
len(text_NY) # 3328489
# Error: Text of length 3328489 exceeds maximum of 1000000
3328489
text_Rome = "".join([i.replace("[", "").replace("]", "") for i in df_Rome["amenities"].values[:10000]])
len(text_Rome) # 5622364
# Error: Text of length 5622364 exceeds maximum of 1000000
5622364
doc_NY = nlp(text_NY[:800000])
doc_Rome = nlp(text_Rome[:800000])
print([token.text for token in doc_NY][:100])
['"', 'Dedicated', 'workspace', '"', ',', '"', 'Essentials', '"', ',', '"', 'Cooking', 'basics', '"', ',', '"', 'Stove', '"', ',', '"', 'Blender', '"', ',', '"', 'Dishes', 'and', 'silverware', '"', ',', '"', 'Hot', 'water', '"', ',', '"', 'Dining', 'table', '"', ',', '"', 'Dryer', '"', ',', '"', 'Washer', '"', ',', '"', 'Lock', 'on', 'bedroom', 'door', '"', ',', '"', 'Kitchen', '"', ',', '"', 'Bathtub', '"', ',', '"', 'Oven', '"', ',', '"', 'Refrigerator', '"', ',', '"', 'Wine', 'glasses', '"', ',', '"', 'Single', 'level', 'home', '"', ',', '"', 'Fire', 'extinguisher', '"', ',', '"', 'Fast', 'wifi', '\\u2013', '330', 'Mbps', '"', ',', '"', 'Luggage', 'dropoff', 'allowed', '"', ',', '"']
print([token.text for token in doc_Rome][:100])
['"', 'Cleaning', 'products', '"', ',', '"', 'Dryer', '"', ',', '"', 'Bed', 'linens', '"', ',', '"', 'Outdoor', 'dining', 'area', '"', ',', '"', 'Hot', 'tub', '"', ',', '"', 'Dining', 'table', '"', ',', '"', 'Luggage', 'dropoff', 'allowed', '"', ',', '"', 'Oven', '"', ',', '"', 'Wifi', '"', ',', '"', 'Free', 'street', 'parking', '"', ',', '"', 'Microwave', '"', ',', '"', 'Free', 'parking', 'on', 'premises', '"', ',', '"', 'Laundromat', 'nearby', '"', ',', '"', 'Drying', 'rack', 'for', 'clothing', '"', ',', '"', 'Kitchen', '"', ',', '"', 'Heating', '"', ',', '"', 'Freezer', '"', ',', '"', 'Coffee', 'maker', '"', ',', '"', 'Patio', 'or', 'balcony', '"', ',', '"', 'Washer', '"', ',']
from collections import Counter
words_NY = [token.lemma_.lower() for token in doc_NY if not token.is_stop and not token.is_punct and not token.is_space]
len(words_NY)
87591
words_Rome = [token.lemma_.lower() for token in doc_Rome if not token.is_stop and not token.is_punct and not token.is_space]
len(words_Rome)
88840
# Häufigkeit für jedes Wort berechnen
word_freq_NY = Counter(words_NY)
# 10 häufigste Wörter auswählen
most_freq_words_NY = word_freq_NY.most_common(10)
for word in most_freq_words_NY:
print(word)
('alarm', 2955)
('wifi', 1961)
('dryer', 1843)
('kitchen', 1719)
('hot', 1638)
('smoke', 1597)
('water', 1562)
('air', 1536)
('heating', 1470)
('allow', 1456)
# Häufigkeit für jedes Wort berechnen
word_freq_Rome = Counter(words_Rome)
# 10 häufigste Wörter auswählen
most_freq_words_Rome = word_freq_Rome.most_common(10)
for word in most_freq_words_Rome:
print(word)
('hot', 1839)
('water', 1802)
('allow', 1624)
('parking', 1378)
('dryer', 1365)
('wifi', 1362)
('coffee', 1335)
('heating', 1322)
('alarm', 1318)
('essential', 1233)
import pandas as pd
tokens_NY = pd.DataFrame({"Token": [token.text for token in doc_NY],
"Lemma": [token.lemma_ for token in doc_NY],
"POS": [token.pos_ for token in doc_NY],
"Tag": [token.tag_ for token in doc_NY],
"Dep": [token.dep_ for token in doc_NY]})
tokens_NY[tokens_NY.POS == "NOUN"]
| Token | Lemma | POS | Tag | Dep | |
|---|---|---|---|---|---|
| 2 | workspace | workspace | NOUN | NN | nmod |
| 6 | Essentials | essential | NOUN | NNS | nmod |
| 11 | basics | basic | NOUN | NNS | appos |
| 23 | Dishes | dish | NOUN | NNS | conj |
| 25 | silverware | silverware | NOUN | NN | conj |
| ... | ... | ... | ... | ... | ... |
| 232724 | maker | maker | NOUN | NN | appos |
| 232729 | parking | parking | NOUN | NN | compound |
| 232731 | premises | premise | NOUN | NNS | appos |
| 232736 | products | product | NOUN | NNS | appos |
| 232742 | unit | unit | NOUN | NN | appos |
56868 rows × 5 columns
import pandas as pd
tokens_Rome = pd.DataFrame({"Token": [token.text for token in doc_Rome],
"Lemma": [token.lemma_ for token in doc_Rome],
"POS": [token.pos_ for token in doc_Rome],
"Tag": [token.tag_ for token in doc_Rome],
"Dep": [token.dep_ for token in doc_Rome]})
tokens_Rome[tokens_Rome.POS == "NOUN"]
| Token | Lemma | POS | Tag | Dep | |
|---|---|---|---|---|---|
| 2 | products | product | NOUN | NNS | dep |
| 10 | Bed | bed | NOUN | NN | compound |
| 11 | linens | linen | NOUN | NNS | conj |
| 16 | dining | dining | NOUN | NN | compound |
| 17 | area | area | NOUN | NN | appos |
| ... | ... | ... | ... | ... | ... |
| 230991 | water | water | NOUN | NN | compound |
| 230992 | kettle | kettle | NOUN | NN | appos |
| 230996 | Hair | hair | NOUN | NN | compound |
| 230997 | dryer | dryer | NOUN | NN | appos |
| 231002 | aid | aid | NOUN | NN | appos |
57915 rows × 5 columns
# Häufigkeit für jedes Noun berechnen
nouns_NY = tokens_NY[tokens_NY.POS == "NOUN"].Token
word_freq_NY = Counter(nouns_NY)
# 10 häufigste Wörter auswählen
most_freq_nouns_NY = word_freq_NY.most_common(10)
for word in most_freq_nouns_NY:
print(word)
('alarm', 2955)
('Wifi', 1836)
('Smoke', 1597)
('water', 1562)
('parking', 1383)
('Heating', 1378)
('dryer', 1367)
('monoxide', 1359)
('TV', 1289)
('Air', 1256)
# Häufigkeit für jedes Noun berechnen
nouns_Rome = tokens_Rome[tokens_Rome.POS == "NOUN"].Token
word_freq_Rome = Counter(nouns_Rome)
# 10 häufigste Wörter auswählen
most_freq_nouns_Rome = word_freq_Rome.most_common(10)
for word in most_freq_nouns_Rome:
print(word)
('water', 1801)
('parking', 1378)
('alarm', 1318)
('dryer', 1291)
('Hair', 1228)
('Essentials', 1227)
('Heating', 1157)
('Wifi', 1153)
('Dishes', 1056)
('Refrigerator', 1050)
Ausstattung Airbnb New York vs Ausstattung Airbnb Rome - Top-10 häufigste Nomen¶
| NYC Wort | NYC Häufigkeit | Rom Wort | Rom Häufigkeit |
|---|---|---|---|
| alarm | 2955 | water | 1801 |
| Wifi | 1836 | parking | 1378 |
| Smoke | 1597 | alarm | 1318 |
| water | 1562 | dryer | 1291 |
| parking | 1383 | Hair | 1228 |
| Heating | 1378 | Essentials | 1227 |
| dryer | 1367 | Heating | 1157 |
| monoxide | 1359 | Wifi | 1153 |
| TV | 1289 | Dishes | 1056 |
| Air | 1256 | Refrigerator | 1050 |